cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Relion install #329

Closed goldgury closed 9 years ago

goldgury commented 9 years ago

Hello, we need to install Relion

http://www2.mrc-lmb.cam.ac.uk/groups/scheres/relion13_tutorial.pdf

on Hal cluster. Assistance will be much appreciated.

Regards,

Yehuda

tatarsky commented 9 years ago

What class of assistance are you asking for here? Building of software on the Hal cluster is normally done by the groups involved. Are you missing some dependency or system level integration?

goldgury commented 9 years ago

I am new to the cluster. Will try to install it myself. I might need extra permissions.

tatarsky commented 9 years ago

What do you mean "extra permissions".

goldgury commented 9 years ago

The default install is done by root. I will try to get around it by changing the target directories. If not, will ask for help.

tatarsky commented 9 years ago

We do not provide users root. What group are you in? Most project software is installed in directories shared by the group. Your homedir tree area should have such a dir with write above your homedir. If you need a module to make it easier to set paths and other environment let me know. We are happy to make those for project software to make it easier for a group.

goldgury commented 9 years ago

I am not sure, what group I am in. I was not expecting to get root. Will let you know if I need soft links, etc.

tatarsky commented 9 years ago

I am happy to review items you are doing to prevent a need for non-standard soft links. It looks from a glance at the install document that a module would be helpful. When you get ready to build, perhaps we can chat in some more real time format so I can provide some advise on making it easier.

tatarsky commented 9 years ago

I'm going to note a few things for you to review before starting.

There are several MPI libraries on this cluster. All of which were installed by previous administrators. I have never been fully clear which is preferred but perhaps some other users of Hal might give you advice.

There is the RPM loaded openmpi libraries. Which I believe has a helper module to assist with environment variables called "openmpi-x86_64" module add openmpi-x86_64

There is an additional module loaded openmpi library which I am unclear on its purpose. module add openmpi_eth

And then there is also mpich. In both RPM and module form.

This will influence your decision early in the build process of which MPI to try. I believe I agree with their suggestion to use openmpi.

I will as a test do a build using that module to see if you are going to hit dependency issues.

jchodera commented 9 years ago

You probably already are aware of this, but you need to compile with the same version of MPI that you intend to launch with via mpirun.

tatarsky commented 9 years ago

That is why I am warning ahead of time of the many MPI copies out there....

tatarsky commented 9 years ago

Just as an example I built the code with:

module add openmpi_eth/1.6.3

And I changed based on it having a system version of the library BUILD_FFTW=false.

Does that code then work? I do not know ;)

If you want a module's path and settings made permanent in your login, use the syntax:

module initadd openmpi_eth/1.6.3

But I just built it in a test area to confirm it got the needed dependencies....hope that is helpful.

tatarsky commented 9 years ago

And when you get to this part of the install document:

http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Download_%26_install

" Edit the environment set-up"

Instead of adding to your .bashrc you might want a quick module called relion which I am happy to make that captures all those settings in a way anyone needing the code can use. Again, optional but happy to make one.

tatarsky commented 9 years ago

You appear to be running your code on the head node.

goldgury commented 9 years ago

What do I need to do?

tatarsky commented 9 years ago

Normally the head node is where people submit Torque/Moab jobs. I assume your code can be submitted as a job. That would then place it properly on the nodes with the needed memory and cpu requirements.

goldgury commented 9 years ago

I stopped it. How do I submit as a job?

akahles commented 9 years ago

https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide

tatarsky commented 9 years ago

Yep. Consult the user guide and if you have questions we will attempt to answer them.

You might want to start with trying this in an interactive job out on a node to determine the resources you are going to need:

https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#interactive-jobs

jchodera commented 9 years ago

@goldgury: I am happy to spend a few minutes helping you get running through the batch queue today if you are still having trouble this afternoon! Are you in RRL?

goldgury commented 9 years ago

Yes rrl214, ext 3944

jchodera commented 9 years ago

I could drop by your office at 130P. Could that work?

goldgury commented 9 years ago

Great, thank you

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 10:43 AM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

I could drop by your office at 130P. Could that work?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147418554.

jchodera commented 9 years ago

Quick update: I spent an hour with Yehuda (@goldgury) debugging a Torque batch script, and we figured out that Relion seems to work when run on the same node (e.g. with nodes=1:ppn=8) but not when MPI processes are scattered across different nodes (e.g. nodes=8,tpn=1). I'm looking into Relion docs to see if there is some other flag we need to feed it to get it to be happy scaling across multiple nodes, since they would like to parallelize to the 100-thread level if possible.

jchodera commented 9 years ago

The Relion FAQ makes reference to a hybrid parallelization model where it uses a combination of MPI (to parallelize across nodes) and threads (to parallelize within nodes):

I am buying a new cluster, what do you recommend to run RELION on? This will of course depend on how much money you are willing to spend, and what kind of jobs you are planning to run. RELION is memory-intensive. Fortunately, it's hybrid-parallelisation allows to make use of modern clusters that consist of many multi-core nodes. In this set-up, MPI-parallelisation provides scalability across the many nodes, while pthreads allow to share the memory available on each of the nodes without leaving its multiple cores idle. Therefore, as long as each node has in total sufficient memory, one can always run multiple threads (and only one or a few MPI job) on each node. Therefore, RAM/node is probably a more important feature than RAM/core. The bigger the size of the boxed particles, the higher the RAM usage. For our high-resolution ribosome refinements (in boxes of ~400x400 pixels) we use somewhere between 15-25Gb of RAM per MPI process (the most expensive part in terms of RAM is the last iteration, which is done at the full image scale). We have 12-cores with 60Gb of RAM in total, so can run 2 MPI processes on each node. If you're planning to do atomic-resolution structures I wouldn't recommend buying anything that has less than 32Gb per node. Having 64Gb or more will probably keep your cluster up-to-date for longer. Then how many of those nodes you buy will probably depend on your budget (and possibly cooling limitations). We do 3.x Angstrom ribosome reconstructions from say 100-200 thousand particles in approximately two weeks using around 200-300 cores in parallel. Using more nodes in parallel (e.g. 1,000) may cause serious scalability issues.

jchodera commented 9 years ago

@tatarsky has found this thread in Japanese that seems to explain how to properly adapt Relion to running in a Torque environment across multiple nodes. It may take a bit of tinkering to get this to work.

@goldgury: Can you chmod -R a+r one of the example data directories so we can test, and provide an example command-line to test with?

goldgury commented 9 years ago

did it to test_onenode

jchodera commented 9 years ago

Did the test with one node seem to work out OK?

goldgury commented 9 years ago

As if running, but time is 00:00

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 4:46 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

Did the test with one node seem to work out OK?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147514974.

goldgury commented 9 years ago

Time started moving, but no output

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 4:46 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

Did the test with one node seem to work out OK?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147514974.

jchodera commented 9 years ago

Could you also chmod a+r ~/Relion.* so we can take a peek at the stdout?

jchodera commented 9 years ago

I see output in the Class2D/ subdirectory:

ls -ltr /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/Class2D/
...
-rw-r--r-- 1 pavletin pavletich 13883948 Oct 12 17:22 run3a_it000_data.star
-rw-r--r-- 1 pavletin pavletich     3809 Oct 12 17:23 run3a_it000_optimiser.star
-rw-r--r-- 1 pavletin pavletich  6772224 Oct 12 17:23 run3a_it000_classes.mrcs
-rw-r--r-- 1 pavletin pavletich  4543788 Oct 12 17:23 run3a_it000_model.star
-rw-r--r-- 1 pavletin pavletich      404 Oct 12 17:23 run3a_it000_sampling.star

Those files were created a little while ago.

goldgury commented 9 years ago

Done.

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 5:39 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

Could you also chmod a+r ~/Relion.* so we can take a peek at the stdout?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147529254.

goldgury commented 9 years ago

I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 5:41 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

I see output in the Class2D/ subdirectory:

ls -ltr /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/Class2D/ ... -rw-r--r-- 1 pavletin pavletich 13883948 Oct 12 17:22 run3a_it000_data.star -rw-r--r-- 1 pavletin pavletich 3809 Oct 12 17:23 run3a_it000_optimiser.star -rw-r--r-- 1 pavletin pavletich 6772224 Oct 12 17:23 run3a_it000_classes.mrcs -rw-r--r-- 1 pavletin pavletich 4543788 Oct 12 17:23 run3a_it000_model.star -rw-r--r-- 1 pavletin pavletich 404 Oct 12 17:23 run3a_it000_sampling.star

Those files were created a little while ago.

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147529530.

jchodera commented 9 years ago

I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?

The hal nodes are connected via 10GE ethernet (as documented in the User Guide). Almost none of our codes use tightly-coupled parallelism, which is why we didn't go with a much more expensive faster interconnect, but we haven't had much trouble with the network speed.

You can check the cluster network load, which shows there has been an increase in network activity, but not to the point where it is outrageous.

Since you're running on a single node, all communication is handled within-node, so I doubt latency is at all a problem here. I am guessing was simply some preprocessing that was required before the refinement rounds started, which is hopefully sped up by using more MPI processes once we figure out how to scale you beyond a full node. You should be able to request up to 32 processes for now (the maximum a single node can handle), though it will have to wait for a whole node to free up, so job start times may be longer on the current cluster (but would be reduced if we added more nodes or changed the job assignment algorithm to leave more free nodes).

goldgury commented 9 years ago

I will continue tomorrow

From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 6:01 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?

The hal nodes are connected via 10GE ethernet (as documented in the User Guidehttps://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#system-configuration). Almost none of our codes use tightly-coupled parallelism, which is why we didn't go with a much more expensive faster interconnect, but we haven't had much trouble with the network speed.

You can check the cluster network loadhttp://hal.cbio.mskcc.org/ganglia/graph_all_periods.php?me=MSKCC-HPC&m=load_one&r=hour&s=by%20name&hc=4&mc=2&g=network_report&z=large, which shows there has been an increase in network activity, but not to the point where it is outrageous.

Since you're running on a single node, all communication is handled within-node, so I doubt latency is at all a problem here. I am guessing was simply some preprocessing that was required before the refinement rounds started, which is hopefully sped up by using more MPI processes once we figure out how to scale you beyond a full node. You should be able to request up to 32 processes for now (the maximum a single node can handle), though it will have to wait for a whole node to free up, so job start times may be longer on the current cluster (but would be reduced if we added more nodes or changed the job assignment algorithm to leave more free nodes).

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147533405.

tatarsky commented 9 years ago

From the job log here are the resources it used before the scheduler killed it:

resources_used.cput=146:46:36
resources_used.mem=1156432kb
resources_used.vmem=4731500kb
resources_used.walltime=21:06:59
jchodera commented 9 years ago

You can increase that default value by adding -l mem=2gb for 2gb,

In the submission script, this would be

#PBS -l mem=2gb

You might need to ask for much more than 2gb, though.

KjongLehmann commented 9 years ago

May I respectfully suggest to solve issues like this via interaction with more experienced cluster users or by revisiting the wiki. It seems to defeat the purpose of this list to actually identify issues?

On Oct 14, 2015, at 12:23 PM, John Chodera notifications@github.com wrote:

You can increase that default value by adding -l mem=2gb for 2gb,

In the submission script, this would be

PBS -l mem=2gb

You might need to ask for much more than 2gb, though.

— Reply to this email directly or view it on GitHub.

jchodera commented 9 years ago

We've been working with @goldgury one-on-one as well, @KjongLehmann.

goldgury commented 9 years ago

May I respectfully remind that the issue was never resolved. I am sorry I can not find the answers anywhere. Queue output:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 === RELION MPI setup ===
 + Number of MPI processes             = 48
 + Master  (0) runs on host            = gpu-2-8.local
 + Slave     1 runs on host            = gpu-2-8.local

How many processes are really running?

qstat -n 6121532

mskcc-fe1.local: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
6121532.mskcc-fe1.loca  pavletin    batch    Relion            23432     6     48    2gb  24:00:00 R  00:15:25
   gpu-2-8/16,18,21,26-30+gpu-1-17/3,9,11,20,22-23,25-26+gpu-3-9/3-9,15
   +gpu-1-12/0-3,5,8-10,12-14,16-17,25-27+gpu-1-6/10,17-20,23-25

Looks like there are 48 threads. 
Each is using close to 100% of CPU.
21952 pavletin  20   0  371m 8176 5316 R 100.3  0.0  18:10.25 relion_refine_m                                            
21951 pavletin  20   0  371m 8176 5320 R 100.0  0.0  18:10.26 relion_refine_m                                            
21953 pavletin  20   0  371m 8184 5324 R 100.0  0.0  18:10.15 relion_refine_m                                            
21954 pavletin  20   0  371m 8172 5312 R 100.0  0.0  18:09.87 relion_refine_m                                            
21956 pavletin  20   0  371m 8196 5312 R 100.0  0.0  18:09.88 relion_refine_m                                            
21949 pavletin  20   0  371m 8168 5304 R 99.7  0.0  18:10.23 relion_refine_m                                             
21950 pavletin  20   0  371m 8192 5320 R 99.7  0.0  18:10.05 relion_refine_m                                             
21955 pavletin  20   0  371m 8188 5320 R 99.7  0.0  18:10.26 relion_refine_m  

There is no output produced after 20 min of running. Same job on my desktop takes 10 min to do the first iteration. Please advise.

tatarsky commented 9 years ago

Where are you expecting to see output? So I can look on the node in question.

goldgury commented 9 years ago

The output is written to /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small There is nothing there from the last run

tatarsky commented 9 years ago

We show 6121532 just completed. What is the expected output look like?

goldgury commented 9 years ago

I killed it and moved Class2D directory where it was supposed to place the output. I submitted 6122593 but it did not start yet.

From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 19, 2015 at 3:04 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

We show 6121532 just completed. What is the expected output look like?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-149314607.

juanperin commented 9 years ago

Yehuda,

Please let it run to completion. The output may be spooled, so it may not initially look like its writing at first. Does anything in the original output location look like it is partially correct or written? I see some items that look fairly recent that i believe are associated with that job…

Juan

On Oct 19, 2015, at 3:08 PM, goldgury notifications@github.com wrote:

I killed it and moved Class2D directory where it was supposed to place the output. I submitted 6122593 but it did not start yet.

From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 19, 2015 at 3:04 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)

We show 6121532 just completed. What is the expected output look like?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-149314607. — Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/329#issuecomment-149315612.

goldgury commented 9 years ago

6122593 status:

showstart 6122593

INFO: cannot determine start time for job 6122593

Is this OK?

goldgury commented 9 years ago

Corrected the error in csh. New job # 6122600

goldgury commented 9 years ago

Here is a sample saba output === RELION MPI setup ===

Please advise

tatarsky commented 9 years ago

How are you passing the machinelist to mpirun? I show you are using:

~/relion-1.4/bin/qsub-cbio_YG.csh
tatarsky commented 9 years ago

Also I show a mix of using /opt/mpich2/gcc/eth/bin and your own OpenMPICH build. Are you setting PATHS to have them match what you built Relion with?