Closed goldgury closed 9 years ago
What class of assistance are you asking for here? Building of software on the Hal cluster is normally done by the groups involved. Are you missing some dependency or system level integration?
I am new to the cluster. Will try to install it myself. I might need extra permissions.
What do you mean "extra permissions".
The default install is done by root. I will try to get around it by changing the target directories. If not, will ask for help.
We do not provide users root. What group are you in? Most project software is installed in directories shared by the group. Your homedir tree area should have such a dir with write above your homedir. If you need a module to make it easier to set paths and other environment let me know. We are happy to make those for project software to make it easier for a group.
I am not sure, what group I am in. I was not expecting to get root. Will let you know if I need soft links, etc.
I am happy to review items you are doing to prevent a need for non-standard soft links. It looks from a glance at the install document that a module would be helpful. When you get ready to build, perhaps we can chat in some more real time format so I can provide some advise on making it easier.
I'm going to note a few things for you to review before starting.
There are several MPI libraries on this cluster. All of which were installed by previous administrators. I have never been fully clear which is preferred but perhaps some other users of Hal might give you advice.
There is the RPM loaded openmpi libraries. Which I believe has a helper module to assist with environment variables called "openmpi-x86_64" module add openmpi-x86_64
There is an additional module loaded openmpi library which I am unclear on its purpose.
module add openmpi_eth
And then there is also mpich. In both RPM and module form.
This will influence your decision early in the build process of which MPI to try. I believe I agree with their suggestion to use openmpi.
I will as a test do a build using that module to see if you are going to hit dependency issues.
You probably already are aware of this, but you need to compile with the same version of MPI that you intend to launch with via mpirun
.
That is why I am warning ahead of time of the many MPI copies out there....
Just as an example I built the code with:
module add openmpi_eth/1.6.3
And I changed based on it having a system version of the library BUILD_FFTW=false
.
Does that code then work? I do not know ;)
If you want a module's path and settings made permanent in your login, use the syntax:
module initadd openmpi_eth/1.6.3
But I just built it in a test area to confirm it got the needed dependencies....hope that is helpful.
And when you get to this part of the install document:
http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Download_%26_install
" Edit the environment set-up"
Instead of adding to your .bashrc you might want a quick module called relion
which I am happy to make that captures all those settings in a way anyone needing the code can use. Again, optional but happy to make one.
You appear to be running your code on the head node.
What do I need to do?
Normally the head node is where people submit Torque/Moab jobs. I assume your code can be submitted as a job. That would then place it properly on the nodes with the needed memory and cpu requirements.
I stopped it. How do I submit as a job?
Yep. Consult the user guide and if you have questions we will attempt to answer them.
You might want to start with trying this in an interactive job out on a node to determine the resources you are going to need:
https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#interactive-jobs
@goldgury: I am happy to spend a few minutes helping you get running through the batch queue today if you are still having trouble this afternoon! Are you in RRL?
Yes rrl214, ext 3944
I could drop by your office at 130P. Could that work?
Great, thank you
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 10:43 AM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
I could drop by your office at 130P. Could that work?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147418554.
Quick update: I spent an hour with Yehuda (@goldgury) debugging a Torque batch script, and we figured out that Relion seems to work when run on the same node (e.g. with nodes=1:ppn=8
) but not when MPI processes are scattered across different nodes (e.g. nodes=8,tpn=1
). I'm looking into Relion docs to see if there is some other flag we need to feed it to get it to be happy scaling across multiple nodes, since they would like to parallelize to the 100-thread level if possible.
The Relion FAQ makes reference to a hybrid parallelization model where it uses a combination of MPI (to parallelize across nodes) and threads (to parallelize within nodes):
I am buying a new cluster, what do you recommend to run RELION on? This will of course depend on how much money you are willing to spend, and what kind of jobs you are planning to run. RELION is memory-intensive. Fortunately, it's hybrid-parallelisation allows to make use of modern clusters that consist of many multi-core nodes. In this set-up, MPI-parallelisation provides scalability across the many nodes, while pthreads allow to share the memory available on each of the nodes without leaving its multiple cores idle. Therefore, as long as each node has in total sufficient memory, one can always run multiple threads (and only one or a few MPI job) on each node. Therefore, RAM/node is probably a more important feature than RAM/core. The bigger the size of the boxed particles, the higher the RAM usage. For our high-resolution ribosome refinements (in boxes of ~400x400 pixels) we use somewhere between 15-25Gb of RAM per MPI process (the most expensive part in terms of RAM is the last iteration, which is done at the full image scale). We have 12-cores with 60Gb of RAM in total, so can run 2 MPI processes on each node. If you're planning to do atomic-resolution structures I wouldn't recommend buying anything that has less than 32Gb per node. Having 64Gb or more will probably keep your cluster up-to-date for longer. Then how many of those nodes you buy will probably depend on your budget (and possibly cooling limitations). We do 3.x Angstrom ribosome reconstructions from say 100-200 thousand particles in approximately two weeks using around 200-300 cores in parallel. Using more nodes in parallel (e.g. 1,000) may cause serious scalability issues.
@tatarsky has found this thread in Japanese that seems to explain how to properly adapt Relion to running in a Torque environment across multiple nodes. It may take a bit of tinkering to get this to work.
@goldgury: Can you chmod -R a+r
one of the example data directories so we can test, and provide an example command-line to test with?
did it to test_onenode
Did the test with one node seem to work out OK?
As if running, but time is 00:00
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 4:46 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
Did the test with one node seem to work out OK?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147514974.
Time started moving, but no output
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 4:46 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
Did the test with one node seem to work out OK?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147514974.
Could you also chmod a+r ~/Relion.*
so we can take a peek at the stdout?
I see output in the Class2D/
subdirectory:
ls -ltr /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/Class2D/
...
-rw-r--r-- 1 pavletin pavletich 13883948 Oct 12 17:22 run3a_it000_data.star
-rw-r--r-- 1 pavletin pavletich 3809 Oct 12 17:23 run3a_it000_optimiser.star
-rw-r--r-- 1 pavletin pavletich 6772224 Oct 12 17:23 run3a_it000_classes.mrcs
-rw-r--r-- 1 pavletin pavletich 4543788 Oct 12 17:23 run3a_it000_model.star
-rw-r--r-- 1 pavletin pavletich 404 Oct 12 17:23 run3a_it000_sampling.star
Those files were created a little while ago.
Done.
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 5:39 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
Could you also chmod a+r ~/Relion.* so we can take a peek at the stdout?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147529254.
I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 5:41 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
I see output in the Class2D/ subdirectory:
ls -ltr /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small/Class2D/ ... -rw-r--r-- 1 pavletin pavletich 13883948 Oct 12 17:22 run3a_it000_data.star -rw-r--r-- 1 pavletin pavletich 3809 Oct 12 17:23 run3a_it000_optimiser.star -rw-r--r-- 1 pavletin pavletich 6772224 Oct 12 17:23 run3a_it000_classes.mrcs -rw-r--r-- 1 pavletin pavletich 4543788 Oct 12 17:23 run3a_it000_model.star -rw-r--r-- 1 pavletin pavletich 404 Oct 12 17:23 run3a_it000_sampling.star
Those files were created a little while ago.
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147529530.
I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?
The hal
nodes are connected via 10GE ethernet (as documented in the User Guide). Almost none of our codes use tightly-coupled parallelism, which is why we didn't go with a much more expensive faster interconnect, but we haven't had much trouble with the network speed.
You can check the cluster network load, which shows there has been an increase in network activity, but not to the point where it is outrageous.
Since you're running on a single node, all communication is handled within-node, so I doubt latency is at all a problem here. I am guessing was simply some preprocessing that was required before the refinement rounds started, which is hopefully sped up by using more MPI processes once we figure out how to scale you beyond a full node. You should be able to request up to 32 processes for now (the maximum a single node can handle), though it will have to wait for a whole node to free up, so job start times may be longer on the current cluster (but would be reduced if we added more nodes or changed the job assignment algorithm to leave more free nodes).
I will continue tomorrow
From: John Chodera notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 12, 2015 at 6:01 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
I saw it also. It takes a while. Latency? How are hal nodes communicating with each other?
The hal nodes are connected via 10GE ethernet (as documented in the User Guidehttps://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#system-configuration). Almost none of our codes use tightly-coupled parallelism, which is why we didn't go with a much more expensive faster interconnect, but we haven't had much trouble with the network speed.
You can check the cluster network loadhttp://hal.cbio.mskcc.org/ganglia/graph_all_periods.php?me=MSKCC-HPC&m=load_one&r=hour&s=by%20name&hc=4&mc=2&g=network_report&z=large, which shows there has been an increase in network activity, but not to the point where it is outrageous.
Since you're running on a single node, all communication is handled within-node, so I doubt latency is at all a problem here. I am guessing was simply some preprocessing that was required before the refinement rounds started, which is hopefully sped up by using more MPI processes once we figure out how to scale you beyond a full node. You should be able to request up to 32 processes for now (the maximum a single node can handle), though it will have to wait for a whole node to free up, so job start times may be longer on the current cluster (but would be reduced if we added more nodes or changed the job assignment algorithm to leave more free nodes).
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-147533405.
From the job log here are the resources it used before the scheduler killed it:
resources_used.cput=146:46:36
resources_used.mem=1156432kb
resources_used.vmem=4731500kb
resources_used.walltime=21:06:59
You can increase that default value by adding -l mem=2gb
for 2gb,
In the submission script, this would be
#PBS -l mem=2gb
You might need to ask for much more than 2gb, though.
May I respectfully suggest to solve issues like this via interaction with more experienced cluster users or by revisiting the wiki. It seems to defeat the purpose of this list to actually identify issues?
On Oct 14, 2015, at 12:23 PM, John Chodera notifications@github.com wrote:
You can increase that default value by adding -l mem=2gb for 2gb,
In the submission script, this would be
PBS -l mem=2gb
You might need to ask for much more than 2gb, though.
— Reply to this email directly or view it on GitHub.
We've been working with @goldgury one-on-one as well, @KjongLehmann.
May I respectfully remind that the issue was never resolved. I am sorry I can not find the answers anywhere. Queue output:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
=== RELION MPI setup ===
+ Number of MPI processes = 48
+ Master (0) runs on host = gpu-2-8.local
+ Slave 1 runs on host = gpu-2-8.local
How many processes are really running?
qstat -n 6121532
mskcc-fe1.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
6121532.mskcc-fe1.loca pavletin batch Relion 23432 6 48 2gb 24:00:00 R 00:15:25
gpu-2-8/16,18,21,26-30+gpu-1-17/3,9,11,20,22-23,25-26+gpu-3-9/3-9,15
+gpu-1-12/0-3,5,8-10,12-14,16-17,25-27+gpu-1-6/10,17-20,23-25
Looks like there are 48 threads.
Each is using close to 100% of CPU.
21952 pavletin 20 0 371m 8176 5316 R 100.3 0.0 18:10.25 relion_refine_m
21951 pavletin 20 0 371m 8176 5320 R 100.0 0.0 18:10.26 relion_refine_m
21953 pavletin 20 0 371m 8184 5324 R 100.0 0.0 18:10.15 relion_refine_m
21954 pavletin 20 0 371m 8172 5312 R 100.0 0.0 18:09.87 relion_refine_m
21956 pavletin 20 0 371m 8196 5312 R 100.0 0.0 18:09.88 relion_refine_m
21949 pavletin 20 0 371m 8168 5304 R 99.7 0.0 18:10.23 relion_refine_m
21950 pavletin 20 0 371m 8192 5320 R 99.7 0.0 18:10.05 relion_refine_m
21955 pavletin 20 0 371m 8188 5320 R 99.7 0.0 18:10.26 relion_refine_m
There is no output produced after 20 min of running. Same job on my desktop takes 10 min to do the first iteration. Please advise.
Where are you expecting to see output? So I can look on the node in question.
The output is written to /cbio/ski/pavletich/home/pavletin/test_onenode/20151002small There is nothing there from the last run
We show 6121532 just completed. What is the expected output look like?
I killed it and moved Class2D directory where it was supposed to place the output. I submitted 6122593 but it did not start yet.
From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 19, 2015 at 3:04 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
We show 6121532 just completed. What is the expected output look like?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-149314607.
Yehuda,
Please let it run to completion. The output may be spooled, so it may not initially look like its writing at first. Does anything in the original output location look like it is partially correct or written? I see some items that look fairly recent that i believe are associated with that job…
Juan
On Oct 19, 2015, at 3:08 PM, goldgury notifications@github.com wrote:
I killed it and moved Class2D directory where it was supposed to place the output. I submitted 6122593 but it did not start yet.
From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, October 19, 2015 at 3:04 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Yehuda Goldgur goldgury@mskcc.org<mailto:goldgury@mskcc.org> Subject: Re: [cbio-cluster] Relion install (#329)
We show 6121532 just completed. What is the expected output look like?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/329#issuecomment-149314607. — Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/329#issuecomment-149315612.
6122593 status:
showstart 6122593
INFO: cannot determine start time for job 6122593
Is this OK?
Corrected the error in csh. New job # 6122600
Here is a sample saba output === RELION MPI setup ===
On hal, I get cat Relion.o6122600 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. === RELION MPI setup ===
Please advise
How are you passing the machinelist to mpirun? I show you are using:
~/relion-1.4/bin/qsub-cbio_YG.csh
Also I show a mix of using /opt/mpich2/gcc/eth/bin and your own OpenMPICH build. Are you setting PATHS to have them match what you built Relion with?
Hello, we need to install Relion
http://www2.mrc-lmb.cam.ac.uk/groups/scheres/relion13_tutorial.pdf
on Hal cluster. Assistance will be much appreciated.
Regards,
Yehuda