adapt qsub to run plain mpirun

kuybeda commented 7 years ago

Create qsub script that invokes mpurun directly instead of calling cluster resource management. This task is complete when the following version of command will run by pressing button "Run" in the relion gui.

mpirun -n 16 -hostfile ./motionhost which relion_run_motioncorr_mpi --i Import/job001/movies.star --o MotionCorr/job003/ --save_movies --first_frame_sum 1 --last_frame_sum 16 --use_unblur --j 1 --unblur_exe /jasper/relion/Unblur/unblur_1.0.2/bin/unblur_openmp_7_17_15.exe --summovie_exe /jasper/relion/Summovie/summovie_1.0.2/bin/sum_movie_openmp_7_17_15.exe --angpix 1.77

arestifo commented 7 years ago

Does this involve setting up a batch management system? Or should it just call qsub which runs mpirun

kuybeda commented 7 years ago

I think we shall start with the latter I.e without resource scheduler. The idea is to evolve Relion GUI environment to work for us on our systems as fast as possible.

I just added Python support that corrects movie motion starting with compressed movie files that we have by default, which is also different and needs to be integrated into the GUI.

Let's talk about that when we meet.

On Jan 2, 2017 3:14 PM, "Alex Restifo" notifications@github.com wrote:

Does this involve setting up a batch management system? Or should it just call qsub which runs mpirun

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KryoEM/AAProject/issues/42#issuecomment-270015717, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZRDbMqSwzZrJypGdXq0FHPp005HAN1ks5rOVqtgaJpZM4LX6pK .

arestifo commented 7 years ago

Update: I am not able to run the command mentioned in the original post because I do not have access to the data mentioned in the command. When I run the 3D classification command (this command is run through the 'run' button in the relion GUI):

mpirun -n 25 `which relion_refine_mpi` --o Class3D/job008/run --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --scratch_dir /scratch --pool 100 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 4 --gpu "0:1:2:3:4:5:6:7" --random_seed 0

It runs with full GPU acceleration across all nodes. The 25 processes are because 24 GPUs are being requested with one master process. The GPU usage on three machines (8 GPUs each) stays at a constant ~60-70% (See the nvidia-smi logs: log 1 and log 2). The relion arguments are as follows: Compute:

Running:

I have managed to get the resource distribution accomplished by editing the default hostfile located at /etc/openmpi/openmpi-default-hostfile on all the nodes. Each of the nodes (except h1018, which has 9 so it can function as a master node) has 8 slots, one for each GPU. When mpirun is called with (for example) mpirun -np 25, mpirun automatically distributes the processes among the worker nodes, with each worker node getting 8 processes at the most.

However, this solution has a few caveats:

Sometimes, the 3d classification completely stops at a random iteration. This is the same issue I was having when we met two weeks ago. I am still not sure why this is happening, but I am investigating.
This solution does not enable switching environments like a batch scheduler would. Without a batch server/scheduler, I am not how how we would accomplish this, but I am doing more research in this question.

Tell me what you think of this method, and if there is anything I can edit to make it better/fix the problems I mentioned.

kuybeda commented 7 years ago

Not sure what exactly the problem, or sets of problems we are trying to resolve. Could you please summarize them in bullets?

1) Generally, any system we will use is going to be composed of a number of equal nodes, so if mpi works with mpirun -np 25 now, the only thing to change would be the number of ranks to use.

2) In openmpi interface, you can specify a custom hosts file using argument -hostfile (look here https://docs.google.com/document/d/1AnYH_BgtgxXu6JBnvjn_lRJNMk4ByngwKt31JAFnqV0/edit#heading=h.2cvu2fe8uckg for an example). So this can be made a part of some gui-based, or automated quiery based setup. For example, you could query each machine for how many GPUs it has and allocate slots according to the reported nu,ber of GPUs.

Please let me know if that helps.

Thanks, Oleg.

On Tue, Jan 3, 2017 at 11:13 AM, Alex Restifo notifications@github.com wrote:

Update: I am not able to run the command mentioned in the original post because I do not have access to the data mentioned in the command. When I run the 3D classification command (this command is run through the 'run' button in the relion GUI):

mpirun -n 25 which relion_refine_mpi --o Class3D/job008/run --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --scratch_dir /scratch --pool 100 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 4 --gpu "0:1:2:3:4:5:6:7" --random_seed 0

It runs with full GPU acceleration across all nodes. The 25 processes are because 24 GPUs are being requested with one master process. The GPU usage on three machines (8 GPUs each) stays at a constant ~60-70% (See the nvidia-smi logs: log 1 https://github.com/KryoEM/AAProject/files/682589/nvidia-smi-1.txt and log 2 https://github.com/KryoEM/AAProject/files/682588/nvidia-smi-2.txt). The relion arguments are as follows: Compute: [image: relion_args_compute] https://cloud.githubusercontent.com/assets/3360300/21613595/df99eed0-d1a3-11e6-81b0-cc5e4620e731.png Running: [image: relion_args_run] https://cloud.githubusercontent.com/assets/3360300/21613613/f48a27ba-d1a3-11e6-983f-a72536ddd7d0.png

I have managed to get the resource distribution accomplished by editing the default hostfile located at /etc/openmpi/openmpi-default-hostfile on all the nodes. Each of the nodes (except h1018, which has 9 so it can function as a master node) has 8 slots, one for each GPU. When mpirun is called with (for example) mpirun -np 25, mpirun automatically distributes the processes among the worker nodes, with each worker node getting 8 processes at the most.

However, this solution has a few caveats:

Sometimes, the 3d classification completely stops at a random iteration. This is the same issue I was having when we met two weeks ago. I am still not sure why this is happening, but I am investigating.

This solution does not enable switching environments like a batch scheduler would. Without a batch server/scheduler, I am not how how we would accomplish this, but I am doing more research in this question.

Tell me what you think of this method, and if there is anything I can edit to make it better/fix the problems I mentioned.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KryoEM/AAProject/issues/42#issuecomment-270151615, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZRDTiTym3eOiScxHmf9_qwtDZgz7Dhks5rOnOkgaJpZM4LX6pK .

arestifo commented 7 years ago

The main problem with this solution is that the environment of the worker machines is not changable at run-time. For example, on biowulf, you could module load x to load the environment for that program, and that environment would be the same in all the worker nodes that you allocate. With our current solution, this is not possible and the environment that you get is the environment that is on the machines (not changeable at run time). We were talking about this issue at our meeting 2 weeks ago, and you mentioned that it would be nice to have this kind of changing environment.

For point #1 in your last comment, I am not sure what you mean by changing the number of ranks. Currently, I am able to utilize all the GPUs in our machines by using mpirun -np 33

For point #2, I am using this solution, except the hostfile is located in /etc/openmpi/openmpi-default-hostfile This eliminates the need for --hostfile <hostfile as an argument to mpirun. The only issue right now is that the default hostfile is not updated when the number of machines changes. Should I automate this process in one of our current playbooks?

Alex

kuybeda commented 7 years ago

Can be trivially done by unchecking queuing option.

KryoEM / AAProject

adapt qsub to run plain mpirun #42