Closed goldgury closed 9 years ago
#!/bin/tcsh
#PBS -l walltime=24:00:00
#PBS -N Relion
#PBS -S /bin/tcsh
#PBS -q batch
#PBS -l nodes=6:ppn=8
#PBS -l mem=2gb
#PBS -k oe
# Environment
source ~/.cshrc
cd /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/
mpirun -n 48 `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent --zero_mask --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale --j 1 --memory_per_thread 4
I don't believe judging from the environment we see on submitted jobs that your method of getting that enviromnent is making it to the qsub. I don't believe your source .cshrc
is working.
For starters, your login shell is bash. Would you like that changed? It would reduce a level of debugging complexity here us? (Rather than involving qsub's via a non-login shell)
Then, I'd really like to understand how your mpirun command is getting the Torque schedulers list of hosts. Normally that is provided in Torque and Moab via a file that the environment points to on the scheduled host as $PBS_NODEFILE. And usually I see an extra argument to mpirun that references that environment variable with something like -machinefile $PBS_NODEFILE
setenv PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin:$PATH
setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/lib:$LD_LIBRARY_PATH
setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/relion-1.4/lib:$LD_LIBRARY_PATH
setenv PATH /cbio/ski/pavletich/home/pavletin/relion-1.4/bin:$PATH
Depends on some compile options. It should, but I assumed you were using one of our system provided openmpis.
I am double checking if your mpirun as source built got the proper config items to automatically detect the machine list or not....(I show you are using your own openMPI build)
Give me a moment to look.
If you want to try something just to see if the behavior changes, add the following after your qsub mpirun and replace the -n 48 with:
mpirun --machinefile $PBS_NODEFILE (rest of line)
I am still looking at your source build.
Reviewing your build based on this item: https://www.open-mpi.org/faq/?category=building#build-rte-tm
echo $PBS_NODEFILE returns nothing. is this OK?
shall I kill 6122600? Elap time 46 min, no output.
Where are you issuing that command for $PBS_NODEFILE? Its only defined in the context of the qsub.
(One sec...long suggestion coming)
Submit an interactive qsub with the same resource items you are already using but add -I
on the end please. Do NOT submit the script. (Note that is a capital I)
qsub -l (your node and ppn choice) -I
Then, make sure your Relion and OpenMPI are in the path by running tcsh and sourcing your environment. Verify with which mpirun
Then see if cat $PBS_NODEFILE
contains some number of hosts some repeated that should match the count in lines of nodes * ppn above.
Then what I'd like you to do is issue the
mpirun --machinefile $PBS_NODEFILE (rest of arguments but leave off -n XX)
I want to know if that at least correctly involves your Relion code. Then we'll work on having your OpenMPI run correctly talk to Torque...
Shall I kill 6122600 before that?
Sure, just to start fresh.
Yeah, no I need you to run an interactive qsub, not a qsub with "-I' in it.
Start with on the head node just running:
qsub -l nodes=4:ppn=4 -I
You'll get a shell on a node. Do the cat $PBS_NODEFILE there.
you also should add -q active
to your interactive qsub
qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6124843.mskcc-fe1.local to start qsub: job 6124843.mskcc-fe1.local ready
-bash-4.1$ cat $PBS_NODEFILE gpu-2-16 gpu-2-16 gpu-2-16 gpu-2-16 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13
which mpirun /opt/mpich2/gcc/eth/bin/mpirun
which mpirun /opt/mpich2/gcc/eth/bin/mpirun
Thats wrong.
You want to make sure your OpenMPI is in the path first.
So source .cshrc from that qsub.
So to be precise. Repeat the process to get an interactive shell. Make sure your environment items are correct in terms of what you were doing in your qsub. (Aka you were running tcsh and source .cshrc)
I expect to see your copy of mpirun as the result of which mpirun.
Then, consider running it with the --machinefile $PBS_NODEFILE because I don't think its talking to Torque right.
And really, if you prefer to work in tcsh we should change your shell. Thats really adding unnecessary confusion to this.
From .cshrc: setenv PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin:$PATH setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/lib:$LD_LIBRARY_PATH
tcsh [pavletin@gpu-2-16 ~]$ source .cshrc [pavletin@gpu-2-16 ~]$ which mpirun /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin/mpirun [pavletin@gpu-2-16 ~]$
Ok, thats looking good. Confirm for me that $PBS_NODES contains multiple hosts...
$PBS_NODEFILE...sorry
cat $PBS_NODEFILE
qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6128812.mskcc-fe1.local to start qsub: job 6128812.mskcc-fe1.local ready
-bash-4.1$ cat $PBS_NODEFILE gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-1-5 gpu-1-5 gpu-1-5 gpu-1-5 -bash-4.1$ which mpirun /opt/mpich2/gcc/eth/bin/mpirun -bash-4.1$
Well, you've dropped into bash again.
How did I?
Your user login shell is bash.
Which is why I'm really recommending if you prefer tcsh (based on your setting of items in .cshrc) that you let me change your login shell to tcsh.
Or, if you prefer bash, you should make those environment changes in a bash related config file.
I prefer tcsh already
Logout completely. It is best to live in the shell you intend to configure things for.
logged out
OK. Log back in. Your default shell will now be tcsh. Please verify that your environment choices are now correct BEFORE we run another interactive qsub. Aka
which mpirun
Should return your built copy as you've defined that in .cshrc
it is tcsh.
qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6128814.mskcc-fe1.local to start qsub: job 6128814.mskcc-fe1.local ready
[pavletin@gpu-1-9 ~]$ which mpirun /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin/mpirun
Cool. Now, also verify that cat $PBS_NODEFILE
in that same shell has lots of lines from the nodes you've been assigned.
Now the output looks the way it should
mpirun --machinefile $PBS_NODEFILE `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent --zero_mask --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale --j 1 --memory_per_thread 4
- === RELION MPI setup ===
+ Number of MPI processes = 16
+ Master (0) runs on host = gpu-1-9.local
+ Slave 1 runs on host = gpu-1-9.local
+ Slave 2 runs on host = gpu-1-9.local
+ Slave 3 runs on host = gpu-1-9.local
+ Slave 4 runs on host = gpu-1-9.local
+ Slave 5 runs on host = gpu-1-9.local
+ Slave 6 runs on host = gpu-1-9.local
+ Slave 7 runs on host = gpu-1-9.local
+ Slave 8 runs on host = gpu-1-9.local
+ Slave 9 runs on host = gpu-1-9.local
+ Slave 10 runs on host = gpu-1-9.local
+ Slave 11 runs on host = gpu-1-9.local
+ Slave 12 runs on host = gpu-1-9.local
+ Slave 13 runs on host = gpu-1-9.local
+ Slave 14 runs on host = gpu-1-9.local
+ Slave 15 runs on host = gpu-1-9.local
=================
Running in double precision.
Estimating initial noise spectra
24/ 24 sec ............................................................~~(,_,">
WARNING: There are only 2 particles in group 131
WARNING: There are only 3 particles in group 142
WARNING: There are only 4 particles in group 161
WARNING: There are only 1 particles in group 163
WARNING: There are only 1 particles in group 165
WARNING: There are only 2 particles in group 174
WARNING: There are only 2 particles in group 175
WARNING: There are only 1 particles in group 176
WARNING: There are only 4 particles in group 177
WARNING: There are only 1 particles in group 178
WARNING: There are only 1 particles in group 179
Yeah! Ok. Now, this leaves an unexplained item still. You should not HAVE to define machinefile. openMPI is supposed to come with support to talk to Torque to get that data.I am reviewing your build and the code itself to understand the situation better.
We may have one further "MPI placement refinement" we will get back to you with for a qsub.
Can it be that my qsub.csh runs now?
Yes. Make sure for now you add the --machinefile $PBS_NODEFILE
And make sure you add it as a variable (not what it returned on your qsub).
What happened?
[cpu-6-2][[11535,1],12][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.56.1 failed: Connection refused ( 111)
That IP address is rather odd as its not anywhere in our environment.
Huh. Its picked up somebodies virtual box running on cpu-6-2. I think we need another argument to mpirun.
here is script, what needs to be changed?
cat relion-1.4/bin/qsub-cbio_YG_nodefile.csh
#!/bin/tcsh
#PBS -l walltime=24:00:00
#PBS -N Relion
#PBS -S /bin/tcsh
#PBS -q batch
#PBS -l nodes=6:ppn=8
#PBS -l mem=2gb
#PBS -k oe
Environment
source ~/.cshrc
cd /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/
mpirun --machinefile $PBS_NODEFILE `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent --zero_mask --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale --j 1 --memory_per_thread 4
I am especially interested in --j 1
Not sure at the moment.
Not sure at the moment what argument limits the interfaces used by MPI. Reading. But its picking up an interface being used by a virtual machine on that node.
Hello, we need to install Relion
http://www2.mrc-lmb.cam.ac.uk/groups/scheres/relion13_tutorial.pdf
on Hal cluster. Assistance will be much appreciated.
Regards,
Yehuda