Relion install - Githubissues

goldgury commented 9 years ago

Hello, we need to install Relion

http://www2.mrc-lmb.cam.ac.uk/groups/scheres/relion13_tutorial.pdf

on Hal cluster. Assistance will be much appreciated.

Regards,

Yehuda

goldgury commented 9 years ago

#!/bin/tcsh
#PBS -l walltime=24:00:00
#PBS -N Relion 
#PBS -S /bin/tcsh
#PBS -q batch
#PBS -l nodes=6:ppn=8 
#PBS -l mem=2gb
#PBS -k oe

# Environment
source ~/.cshrc
cd /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/
mpirun -n 48 `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent  --zero_mask  --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale  --j 1 --memory_per_thread 4

tatarsky commented 9 years ago

I don't believe judging from the environment we see on submitted jobs that your method of getting that enviromnent is making it to the qsub. I don't believe your source .cshrc is working.

For starters, your login shell is bash. Would you like that changed? It would reduce a level of debugging complexity here us? (Rather than involving qsub's via a non-login shell)

tatarsky commented 9 years ago

Then, I'd really like to understand how your mpirun command is getting the Torque schedulers list of hosts. Normally that is provided in Torque and Moab via a file that the environment points to on the scheduled host as $PBS_NODEFILE. And usually I see an extra argument to mpirun that references that environment variable with something like -machinefile $PBS_NODEFILE

goldgury commented 9 years ago

setenv PATH            /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin:$PATH
setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/lib:$LD_LIBRARY_PATH

setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/relion-1.4/lib:$LD_LIBRARY_PATH
setenv PATH            /cbio/ski/pavletich/home/pavletin/relion-1.4/bin:$PATH

tatarsky commented 9 years ago

Depends on some compile options. It should, but I assumed you were using one of our system provided openmpis.

I am double checking if your mpirun as source built got the proper config items to automatically detect the machine list or not....(I show you are using your own openMPI build)

tatarsky commented 9 years ago

Give me a moment to look.

tatarsky commented 9 years ago

If you want to try something just to see if the behavior changes, add the following after your qsub mpirun and replace the -n 48 with:

mpirun --machinefile $PBS_NODEFILE  (rest of line)

I am still looking at your source build.

tatarsky commented 9 years ago

Reviewing your build based on this item: https://www.open-mpi.org/faq/?category=building#build-rte-tm

goldgury commented 9 years ago

echo $PBS_NODEFILE returns nothing. is this OK?

goldgury commented 9 years ago

shall I kill 6122600? Elap time 46 min, no output.

tatarsky commented 9 years ago

Where are you issuing that command for $PBS_NODEFILE? Its only defined in the context of the qsub.

tatarsky commented 9 years ago

(One sec...long suggestion coming)

tatarsky commented 9 years ago

Submit an interactive qsub with the same resource items you are already using but add -I on the end please. Do NOT submit the script. (Note that is a capital I)

qsub -l (your node and ppn choice) -I

Then, make sure your Relion and OpenMPI are in the path by running tcsh and sourcing your environment. Verify with which mpirun

Then see if cat $PBS_NODEFILE contains some number of hosts some repeated that should match the count in lines of nodes * ppn above.

Then what I'd like you to do is issue the

mpirun --machinefile $PBS_NODEFILE  (rest of arguments but leave off -n XX)

I want to know if that at least correctly involves your Relion code. Then we'll work on having your OpenMPI run correctly talk to Torque...

goldgury commented 9 years ago

Shall I kill 6122600 before that?

tatarsky commented 9 years ago

Sure, just to start fresh.

tatarsky commented 9 years ago

Yeah, no I need you to run an interactive qsub, not a qsub with "-I' in it.

Start with on the head node just running:

qsub -l nodes=4:ppn=4 -I

You'll get a shell on a node. Do the cat $PBS_NODEFILE there.

akahles commented 9 years ago

you also should add -q active to your interactive qsub

goldgury commented 9 years ago

qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6124843.mskcc-fe1.local to start qsub: job 6124843.mskcc-fe1.local ready

-bash-4.1$ cat $PBS_NODEFILE gpu-2-16 gpu-2-16 gpu-2-16 gpu-2-16 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13 gpu-2-13

goldgury commented 9 years ago

which mpirun /opt/mpich2/gcc/eth/bin/mpirun

tatarsky commented 9 years ago

which mpirun /opt/mpich2/gcc/eth/bin/mpirun

Thats wrong.

You want to make sure your OpenMPI is in the path first.

So source .cshrc from that qsub.

tatarsky commented 9 years ago

So to be precise. Repeat the process to get an interactive shell. Make sure your environment items are correct in terms of what you were doing in your qsub. (Aka you were running tcsh and source .cshrc)

I expect to see your copy of mpirun as the result of which mpirun.

Then, consider running it with the --machinefile $PBS_NODEFILE because I don't think its talking to Torque right.

tatarsky commented 9 years ago

And really, if you prefer to work in tcsh we should change your shell. Thats really adding unnecessary confusion to this.

goldgury commented 9 years ago

From .cshrc: setenv PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin:$PATH setenv LD_LIBRARY_PATH /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/lib:$LD_LIBRARY_PATH

tcsh [pavletin@gpu-2-16 ~]$ source .cshrc [pavletin@gpu-2-16 ~]$ which mpirun /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin/mpirun [pavletin@gpu-2-16 ~]$

tatarsky commented 9 years ago

Ok, thats looking good. Confirm for me that $PBS_NODES contains multiple hosts...

tatarsky commented 9 years ago

$PBS_NODEFILE...sorry

tatarsky commented 9 years ago

cat $PBS_NODEFILE

goldgury commented 9 years ago

qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6128812.mskcc-fe1.local to start qsub: job 6128812.mskcc-fe1.local ready

-bash-4.1$ cat $PBS_NODEFILE gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-2-15 gpu-1-5 gpu-1-5 gpu-1-5 gpu-1-5 -bash-4.1$ which mpirun /opt/mpich2/gcc/eth/bin/mpirun -bash-4.1$

tatarsky commented 9 years ago

Well, you've dropped into bash again.

goldgury commented 9 years ago

How did I?

tatarsky commented 9 years ago

Your user login shell is bash.

tatarsky commented 9 years ago

Which is why I'm really recommending if you prefer tcsh (based on your setting of items in .cshrc) that you let me change your login shell to tcsh.

tatarsky commented 9 years ago

Or, if you prefer bash, you should make those environment changes in a bash related config file.

goldgury commented 9 years ago

I prefer tcsh already

tatarsky commented 9 years ago

Logout completely. It is best to live in the shell you intend to configure things for.

goldgury commented 9 years ago

logged out

tatarsky commented 9 years ago

OK. Log back in. Your default shell will now be tcsh. Please verify that your environment choices are now correct BEFORE we run another interactive qsub. Aka

which mpirun

Should return your built copy as you've defined that in .cshrc

goldgury commented 9 years ago

it is tcsh.

qsub -l nodes=4:ppn=4 -I qsub: waiting for job 6128814.mskcc-fe1.local to start qsub: job 6128814.mskcc-fe1.local ready

[pavletin@gpu-1-9 ~]$ which mpirun /cbio/ski/pavletich/home/pavletin/npp_openmpi_install/bin/mpirun

tatarsky commented 9 years ago

Cool. Now, also verify that cat $PBS_NODEFILE in that same shell has lots of lines from the nodes you've been assigned.

goldgury commented 9 years ago

Now the output looks the way it should

mpirun --machinefile $PBS_NODEFILE `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent  --zero_mask  --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale  --j 1 --memory_per_thread 4 
- === RELION MPI setup ===
 + Number of MPI processes             = 16
 + Master  (0) runs on host            = gpu-1-9.local
 + Slave     1 runs on host            = gpu-1-9.local
 + Slave     2 runs on host            = gpu-1-9.local
 + Slave     3 runs on host            = gpu-1-9.local
 + Slave     4 runs on host            = gpu-1-9.local
 + Slave     5 runs on host            = gpu-1-9.local
 + Slave     6 runs on host            = gpu-1-9.local
 + Slave     7 runs on host            = gpu-1-9.local
 + Slave     8 runs on host            = gpu-1-9.local
 + Slave     9 runs on host            = gpu-1-9.local
 + Slave    10 runs on host            = gpu-1-9.local
 + Slave    11 runs on host            = gpu-1-9.local
 + Slave    12 runs on host            = gpu-1-9.local
 + Slave    13 runs on host            = gpu-1-9.local
 + Slave    14 runs on host            = gpu-1-9.local
 + Slave    15 runs on host            = gpu-1-9.local
 =================
 Running in double precision. 
 Estimating initial noise spectra 
  24/  24 sec ............................................................~~(,_,">
WARNING: There are only 2 particles in group 131
WARNING: There are only 3 particles in group 142
WARNING: There are only 4 particles in group 161
WARNING: There are only 1 particles in group 163
WARNING: There are only 1 particles in group 165
WARNING: There are only 2 particles in group 174
WARNING: There are only 2 particles in group 175
WARNING: There are only 1 particles in group 176
WARNING: There are only 4 particles in group 177
WARNING: There are only 1 particles in group 178
WARNING: There are only 1 particles in group 179

tatarsky commented 9 years ago

Yeah! Ok. Now, this leaves an unexplained item still. You should not HAVE to define machinefile. openMPI is supposed to come with support to talk to Torque to get that data.I am reviewing your build and the code itself to understand the situation better.

tatarsky commented 9 years ago

We may have one further "MPI placement refinement" we will get back to you with for a qsub.

goldgury commented 9 years ago

Can it be that my qsub.csh runs now?

tatarsky commented 9 years ago

Yes. Make sure for now you add the --machinefile $PBS_NODEFILE And make sure you add it as a variable (not what it returned on your qsub).

goldgury commented 9 years ago

What happened?

[cpu-6-2][[11535,1],12][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.56.1 failed: Connection refused ( 111)

Slave 1 runs on host = gpu-2-4.local [gpu-1-6][[11535,1],11][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.56.1 failed: Connection refused ( 111) Terminated

tatarsky commented 9 years ago

That IP address is rather odd as its not anywhere in our environment.

tatarsky commented 9 years ago

Huh. Its picked up somebodies virtual box running on cpu-6-2. I think we need another argument to mpirun.

goldgury commented 9 years ago

here is script, what needs to be changed?

cat relion-1.4/bin/qsub-cbio_YG_nodefile.csh 
#!/bin/tcsh
#PBS -l walltime=24:00:00
#PBS -N Relion 
#PBS -S /bin/tcsh
#PBS -q batch
#PBS -l nodes=6:ppn=8 
#PBS -l mem=2gb
#PBS -k oe

 Environment
source ~/.cshrc
cd /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/
mpirun --machinefile $PBS_NODEFILE `which relion_refine_mpi` --o Class2D/run3a --i merged3e.star --particle_diameter 340 --angpix 2.82 --ctf --iter 25 --tau2_fudge 1.5 --K 50 --flatten_solvent  --zero_mask  --strict_highres_exp 12 --oversampling 1 --psi_step 10 --offset_range 10 --offset_step 4 --norm --scale  --j 1 --memory_per_thread 4

goldgury commented 9 years ago

I am especially interested in --j 1

tatarsky commented 9 years ago

Not sure at the moment.

tatarsky commented 9 years ago

Not sure at the moment what argument limits the interfaces used by MPI. Reading. But its picking up an interface being used by a virtual machine on that node.

cBio / cbio-cluster

Relion install #329