3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
445 stars 200 forks source link

Setting Number of MPI procs and threads #784

Closed NuwandaCZ closed 3 years ago

NuwandaCZ commented 3 years ago

Hello there,

I've been thrown into SPA, cryoem data post-processing world recently and one thing keeps coming back at me every time I use Relion. The effectiveness of computation. I am having a hard time finding a good source(s), where I could learn how to set MPI procs and threads effectively. I've found some notes, official relion docs pdf from May 13th and even people I joined seem to base these parameters on experience and experimenting, which doesn't suit me at the moment. Even more so when I'm switching between computing stations and the computer parameters vary.

Does anyone have a good source where I could learn more- to be able to set these parameters better based on data size, RAM, CPUs, GPUs? Or some kind of "rule of thumb" from which I could start optimizing them? Believe me I'm not glad spamming here, I saw that as the last option, my google-fu has failed me.

biochem-fan commented 3 years ago

Please write your hardware configuration and details of your dataset, because this really depends on them.

For example, there is no point me advising "use 5 MPI processes with 8 threads each for Refine3D" if you don't have 4 GPUs and 32 cores in the node.

NuwandaCZ commented 3 years ago

Please write your hardware configuration and details of your dataset, because this really depends on them.

Hi @biochem-fan,

firstly- I'm sorry for the delay, I didn't have access to computing stations in the past days. I was looking for a general strategy and didn't expect such concrete help, which would be great! Let me mention the HW config and dataset info of two sample stations. I don't wanna copy paste all the info and spam you, but if I'll forget to mention something important, let me know.

PC1

CPU Architecture: x86_64 CPU(s): 24 Thread(s) per core: 2 Core(s) per socket:12 Socket(s): 1 GPU 3x 8GB GPU with GDDR6 & CUDA 10.1 RAM 64GB

PC2

CPU Architecture: x86_64 CPU(s): 64 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 GPU 4x 24GB with GDDR6 & CUDA 10.1 RAM 256GB

As I mentioned so far I was concentrating on SPA, hence motioncorr, ctf est., autopick, 3d autoref, polishing etc.. (you probably already know) are my biggest interest right now. I know the processes vary and the MPI (and threads) settings vary but if you could mention what settings would you choose (and a brief note why), that would be.. lovely!

biochem-fan commented 3 years ago

For most GPU accelerated jobs (Refine3D, Class2D, Class3D, MultiBody), use 1 MPI process per GPU plus one and 8-10 threads / process. The first MPI rank does not perform any real calculation. For example, for a Class2D job on PC2, I would use 4+1 = 5 MPI processes (to use 4 GPUs), 8 threads each. Since the first MPI rank do not really use CPU much, this actually uses only 4 * 8 = 32 cores. You have 32 cores left to run other non-GPU accelerated task (e.g. CtfRefine, MotionCorr etc) simultaneously.

Refine3D (but not others) needs an odd number of MPI processes, one for the first MPI rank and the same numbers for two half sets. Thus, you cannot fully use 3 GPUs on PC1 for a Refine3D job; you can use only two GPUs.

You can put more than one MPI processes on a GPU if the GPU memory is sufficiently large. If GPU utilization (you can check it by nvidia-smi while you are in the E step) remains constantly low (< 70 %), you might want to try running 2 MPI processes per GPU. But 8 GB of GPU memory on PC1 will probably allow this only for Class2D. Thus you cannot run 3 * 2 + 1 = 7 MPI processes on PC1 to fully use 3 GPUs in a Refine3D job.

For most CPU-only jobs (RELION's motion correction, CtfRefine, Polish), I would use 4 or 6 threads per MPI process and increase the number of MPI processes to use all available CPU cores (e.g. 6 MPI processes 4 threads / process = 24 cores in total; note that the first MPI rank of these jobs performs actual work unlike Refine3D/Class2D/3D/Multibody). Unfortunately, PC1 has only 64 GB of system memory, so this might run out of memory. In this case, reduce the number of MPI processes by assigning more threads per process (e.g. 3 MPI process 8 threads / process). The memory usage is roughly proportional to the number of MPI processes, not the number of (total) threads.

Some jobs (CTFFind, Extract, AutoPick) do not use threading. Use one MPI process per CPU (or GPU for AutoPick). Again, the first MPI rank does the real work.

Also note that you need a very fast storage to fully utilize your CPU and GPU. If access to your movies is slow (e.g. 1 Gbps connection to the file server or non-RAID disk), that can severely limit your processing speed.

NuwandaCZ commented 3 years ago

This is exactly what I was looking for! Thank you so much for this guide and for providing such a fast responses. This will be a big help. I'm closing this issue 👍

CharlesCongdon commented 3 years ago

FYI, you'll also find recommendations here: https://www3.mrc-lmb.cam.ac.uk/relion/index.php/Benchmarks_%26_computer_hardware We have also observed that refine performance can vary a lot based on the number of ranks and threads and the number of CPU cores and GPUs you have - sometimes by a large amount. Often the golden configuration is only found by experimentation. The recommendations also differ based on whether you are on a system with GPUs or not.
We also have seem good speedups on the GPU if you also use NVidia MPS (see https://docs.nvidia.com/deploy/mps/index.html and https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf)