RELION behaves like there is GPU sharing in Refine3D jobs when used on a slurm cluster causing VRAM errors

DimitriosBellos commented 5 months ago

Dear RELION dev team,

Hi, my name is Dimitrios Bellos and I am a member of the AI & I Core team in the Rosalind Franklin Institute. Our team help with supporting our Franklin RELION users with issues. Our users operate RELION typically on the Baskerville HPC (https://www.baskerville.ac.uk/ , https://docs.baskerville.ac.uk/request-access/ ) taking advantage of the multiple compute nodes housed there each comprised by 4 A100s Nvidia GPUs (read more here about the Baskerville system https://docs.baskerville.ac.uk/system/)

Recently, during the execution of Refine3D (aka auto-refine) on the Baskerville HPC, our users have noticed their jobs crashing and the VRAM of the GPU being subutilised.

As a typical behaviour of using RELION on a slurm scheduler each MPI procs equates to one slurm task.

The users have set that each MPI proc, aka slurm task, to have access to 1 GPU (these are different GPUs per task by default) Some of the tasks landed on the same compute node, but when this happen they run independently from each other. E.g. if one task uses the 2nd GPU of a node and another the 4th, in both of them if the nvidia-smi command is run, each see only 1 GPU and this with index 0. This happens in order to prevent different slurm jobs/tasks accidentally share the same GPU.

However it seems with the Refine3D Relion job if e.g. 2 tasks land on the same node (same node id) because both tasks use GPU with index 0 (same index but they are different GPUs, it is because in each slurm task, the GPUs are reindexed based on how many GPUs each task can access), then the VRAM gets limited to only half of the available. If 4 slurm tasks land on the same node then the accessible VRAM gets limited to one quarter.

As we discussed this issue with Colin Palmer from CCP-EM (https://www-test.ccpem.ac.uk/about/colin-palmer/) because each MPI proc aka slurm task has access to 1 GPU and in all tasks the index of the GPU is 0, if multiple tasks land on the same node then RELION probably thinks there is GPU sharing.

Furthermore, following Colin Palmer's advice, we added a --ntasks-per-node slurm argument to be equal to 1, thus forcing the tasks to land on different compute nodes. This allowed each task to use the GPU VRAM fully but the Relion job still crashed. Our users, where able to fix the issue but for this they had to add the --free_gpu_memory 1000 flag. I guess even when the different tasks land on different nodes, RELION still believes there is some GPU sharing (because in all tasks the Nvidia index is 0), otherwise the --free_gpu_memory would not prove to be useful (since it is a flag used when there is GPU sharing)

Can you please help us with this issue? What we would like is to potential check why RELION believes there is GPU sharing even though it is normal in slurm schedulers that when multiple slurm tasks are being spawned (aka MPI procs) if each gets 1 GPU this GPU is always a different one. And because each task can only access 1 GPU from the task's point of view this GPU has always index 0.

Below I also add the run_submit.script

#!/bin/bash
#SBATCH -J Relion
#SBATCH -n 5
#SBATCH -c 36
#SBATCH -e Refine3D/job037/run.err
#SBATCH -o Refine3D/job037/run.out
#SBATCH -q <SECRET>
#SBATCH -t 0-1:00:00
#SBATCH -A <SECRET>
#SBATCH --gpus-per-task=1
#SBATCH --export=NONE
#SBATCH --get-user-env
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=<SECRET>
#SBATCH --ntasks-per-node=1

module purge
module load baskerville
module load RELION/4.0.0-foss-2021a-CUDA-11.3.1

srun echo $CUDA_VISIBLE_DEVICES
srun nvidia-smi -L
srun relion_refine_mpi --o Refine3D/job037/run --auto_refine --split_random_halves --i star/70s_ribodist_frontback.star --ref refs/run_class001.mrc --ini_high 15 --dont_combine_weights_via_disc --pool 16 --pad 2  --ctf --particle_diameter 300 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job009/mask_224px.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 36 --gpu ""  --pipeline_control Refine3D/job037/

You may also contact me and/or Colin Palmer for more questions.

Environment:

HPC: Baskerville HPC
OS: Red Hat Enterprise Linux 8.5 (Ootpa)
MPI runtime: OpenMPI 4.1.1
RELION version RELION-4.0.0
Memory: 108GB per MPI proc aka slurm task
GPU type: A100 40GB VRAM
GPUs per MPI proc, aka slurm task: 1

Dataset:

Box size: 224 px
Pixel size: 1.9A/px
Number of particles: 82,377
Description: Ribosome particles

Job options:

Type of job: Refine3D
Number of MPI processes: 5
Number of threads: 36

Full command (see note.txt in the job directory):

srun relion_refine_mpi --o Refine3D/job037/run --auto_refine --split_random_halves --i star/70s_ribodist_frontback.star --ref refs/run_class001.mrc --ini_high 15 --dont_combine_weights_via_disc --pool 16 --pad 2  --ctf --particle_diameter 300 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job009/mask_224px.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 36 --gpu ""  --pipeline_control Refine3D/job037/

Error message:

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

in: /dev/shm/build-branfosj-admin-live/RELION/4.0.0/foss-2021a-CUDA-11.3.1/relion-4.0.0/src/acc/cuda/cuda_fft.h, line 226
ERROR: 
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

follower 3 encountered error: === Backtrace  ===
/bask/apps/live/EL8-ice/software/RELION/4.0.0-foss-2021a-CUDA-11.3.1/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x63) [0x4c40e3]
/bask/apps/live/EL8-ice/software/RELION/4.0.0-foss-2021a-CUDA-11.3.1/bin/relion_refine_mpi() [0x4a5741]
/bask/apps/live/EL8-ice/software/RELION/4.0.0-foss-2021a-CUDA-11.3.1/bin/relion_refine_mpi() [0x6ddd04]
/bask/apps/live/EL8-ice/software/GCCcore/10.3.0/lib64/libgomp.so.1(+0x1a046) [0x149ab47ce046]
/lib64/libpthread.so.0(+0x81ca) [0x149aa2e391ca]
/lib64/libc.so.6(clone+0x43) [0x149aa220be73]
==================
ERROR: 

When trying to plan one or more Fourier transforms, it was found that the available
GPU memory was insufficient. Relion attempts to reduce the memory by segmenting
the required number of transformations, but in this case not even a single
transform could fit into memory. Either you are (1) performing very large transforms,
or (2) the GPU had very little available memory.

    (1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
    recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
    This reduces memory requirements proportionally to the low-pass used. 

    (2) may occur if multiple processes were using the same GPU without being aware
    of each other, or if there were too many such processes. Parallel execution of 
    relion binaries ending with _mpi ARE aware, but you may need to reduce the number
    of mpi-ranks to equal the total number of GPUs. If you are running other instances 
    of GPU-accelerated programs (relion or other), these may be competing for space.
    Relion currently reserves all available space during initialization and distributes
    this space across all sub-processes using the available resources. This behaviour 
    can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
    further and force use of full dynamic runtime memory allocation, relion can be 
    built with the cmake -DCachedAlloc=OFF

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 633813.2 ON bask-pg0308u05a CANCELLED AT 2024-01-30T17:17:19 ***
srun: error: bask-pg0308u05a: task 0: Killed
srun: error: bask-pg0309u32a: task 4: Killed
srun: error: bask-pg0308u37a: task 2: Killed
srun: error: bask-pg0308u24a: task 1: Killed
srun: error: bask-pg0309u31a: task 3: Exited with exit code 1
==================

biochem-fan commented 5 months ago

I cannot directly address your question because I am not a SLURM expert but the following might give some insights.

In our cluster, we use SLURM only to control the visibility of GPUs per job, not per MPI process (i.e. task). For example, our job template for a "half node" job looks like:

#SBATCH -N 1
#SBATCH --ntasks-per-node=36 # grab half node
#SBATCH --gres=gpu:2 # with 2 (out of 4) GPUs
#SBATCH --mem=256G # and half memory
...

mpirun --oversubscribe -n XXXmpinodesXXX XXXcommandXXX

Users are free to divide 36 cores into e.g. 5 MPI processes x 9 threads, 3 MPI processes x 18 threads etc. Because the first MPI rank of a Refine3D/Class3D job does not perform real computation and does not need a GPU, this is more efficient. --oversubscribe allows this. Because SLURM limits available cores by cgroups, this does not harm other jobs that are running on the same node. Your job script wastes an expensive A100 on the first rank by using --gpus-per-task=1.

All RELION processes see two GPUs 0 and 1 (out of four). They might be physically 0 and 1, or 0 and 4, or whatever but cgroup renumbers allocated GPUs to 0 and 1 and hides the others.

DimitriosBellos commented 5 months ago

Thank you for answer.

The issue does not have to do with the number of MPI procs, or the number of threads per MPI proc (aka slurm task) or what is their best combo.

The issue is regarding the GPUs. Our users want to use more than 4 GPUs per jobs (maybe 5, 6, 7, etc) to accelerate their processing pipelines.

The reason they have set --cpus-per-task (aka -c) to 36 is because they have also set (--gpus-per-task=1) since each node in Baskerville HPC has 4 GPUs and 144 cores, then because each task should have its own independent GPU this will lead each task using one quarter of the GPUs in a node and so one quarter of the CPUs is 36

Typically, we would like the tasks to be node agnostic. Aka e.g. for a job with 8 tasks, 3 may run on the same node, another 2 on another node, and the last 3 on 3 different modes (total of 5 modes). This way the job does not have to wait all 8 GPUs to be freed up necessary on 2 nodes (4+4 GPUs).

Thusly fixing the issue is important. Just because e.g. 3 MPI procs (aka slurm tasks) are on the same node and for each MPI proc its respective GPU has index 0, RELION should not deduce that they are using the same GPU. It is just the slurm reindex the GPU Nvidia index so that each MPI proc (aka slurm task) can 'see' only 1 GPU and this GPU has index 0

I understand that the first MPI job does not do anything and a GPU is not utilised, but we want to run mult-node jobs and as far as I know there is not a way in slurm to discriminate resources given to the different tasks and thusly have the first task use 0 GPUs, but all the rest use 1.

I believe this is an issue with how Refine3D MPI has been implemented, because in principal the first MPI with rank 0 should also act as parallel process that does computations, it is just that it has additional functionality to perform. This typically it is done by having multiple if rank==0 in the code, that will be performed only by the first MPI proc, and all other code is performed by all MPI procs (including the rank 0) since it has to do with computation.

biochem-fan commented 5 months ago

I don't recommend running RELION GPU jobs over multiple nodes. This is inefficient and leads to fragmentation of jobs.

If you restrict RELION to a single node, you can allocate one task for RELION, 2 or 4 GPUs for the task and run multiple MPI processes within the one task. Many people use RELION in this way on SLURM.

Our users want to use more than 4 GPUs per jobs (maybe 5, 6, 7, etc) to accelerate their processing pipelines.

This is a bad idea. RELION does not scale well over 4 GPUs. Actually A100 is overkill for RELION. One RELION process does not have enough parallelism to saturate a A100. I recommend running 2 or 3 MPI processes per A100 and 2 (or 3 or 4 if the job is really big) GPUs per job. There are discussions regarding this on the CCPEM mailing list.

I believe this is an issue with how Refine3D MPI has been implemented,

You are correct but changing the current implementation requires huge refactoring and is very unlikely to happen.

colinpalmer commented 5 months ago

To address the specific issue of GPU memory sharing: would it be possible for RELION to use the GPU UUIDs to determine when a GPU is shared between multiple processes, rather than just using the single-digit IDs?

biochem-fan commented 5 months ago

To address the specific issue of GPU memory sharing: would it be possible for RELION to use the GPU UUIDs to determine when a GPU is shared between multiple processes, rather than just using the single-digit IDs?

If someone writes, tests and sends a pull request for this, I am happy to review it.

But in general, I don't like SLURM controlling how resources are allocated to each process ("task") on the same node. It is completely fine to hide and block access to CPU cores, memory and GPUs allocated to other jobs but resources for a job should be shared among all processes of the same job on the same node. In other words, RELION likes one task per node, regardless of the actual number of MPI processes running on the node.

It is fine to use compartmentalization, virtualization, containerization etc but problems caused by these additional layers of abstraction and complication should be dealt with by those layers, not by RELION.

biochem-fan commented 5 months ago

In our cluster above, although we use multiple tasks,

#SBATCH -N 1
#SBATCH --ntasks-per-node=36 # grab half node
#SBATCH --gres=gpu:2 # with 2 (out of 4) GPUs

nvidia-smi within this script shows 2 GPUs and all MPI processes see 2 GPUs. In this case, GPUs are allocated to the entire job, not per task. This is what I want.

DimitriosBellos commented 5 months ago

Our use case is multi-node since due to size of data using more than 4 GPUs at a time is necessary. Regarding your prior reply.

"But in general, I don't like SLURM controlling how resources are allocated to each process ("task") on the same node. It is completely fine to hide and block access to CPU cores, memory and GPUs allocated to other jobs but resources for a job should be shared among all processes of the same job on the same node. In other words, RELION likes one task per node, regardless of the actual number of MPI processes running on the node."

I understand your argument but unfortunately this is how slurm scheduler operates. Slurm tasks run independenly of each other and they are completely agnostic if they land on the same node and even if they do, they are independent meaning that one cannot access the GPUs of the other, even if they are tasks spawned by the same job. Slurm schedulers are being used on a large number of modern HPCs (ARCHER2, Baskerville, etc.) and thus having RELION being able to run optimally on them will accelarate scientific discoveries and of courses acknowledgements and citations of RELION.

I am just mentioning this fact, because if RELION had an option that will make it ignore in which node the tasks are operated and instead it was functioning as if all tasks are being processed on different nodes, then this issue would be resolved.

Furthermore, the issue of the first MPI proc with rank 0 not to being used to run computations, in my opinion it is a very suboptimal choice, though I totally understand the difficulty and time required to refactor the software to resolve it. The reason I believe it is suboptimal is because even in HPCs that only have CPUs, the memory - RAM allocation is expected to be proportional to how much RAM is installed on a node and how many CPUs of the node are being used per task. Because of this, there might be use cases (based on the dataset and processing options that are being used) where the number of threads per MPI proc-task should be high (to accommodate also high RAM allocation per task). In these use cases and of course in use cases where per task, 1 high performance GPU is being allocated, the fact the the first MPI proc with rank 0 is not performing any computations leads to subutilising the compute resources of the HPC(s).

As I already mentioned, I understand the difficult of refactoring RELION in order to:

To operate in a fashion that will ignore in which node each MPI proc - slurm task is being run and operate assuming all tasks are being run on different nodes.
To have the first MPI proc with rank 0 also performing computations.

However, the points above might be useful to keep in mind at least when developing the future versions of RELION, since they can lead to better and more optimised perforance of RELION on HPCs, accelarate scientific discoveries, and increase the number of acknowledgements and citations of the RELION suite.

biochem-fan commented 5 months ago

We do use SLURM ourselves. The difference is that we don't use SLURM's task concept.

I understand that this is not compatible for completely arbitrary node allocation. But in practice, we use only 4 GPUs on one node or 2 GPU on half node so this never became a problem. We don't let a job to run on 2+1+1 GPUs on 3 nodes, because this is inefficient (i.e. 3x storage access and synchronization over network).

Our use case is multi-node since due to size of data using more than 4 GPUs at a time is necessary.

We did extensive tests on scaling and we concluded that going beyond 4 GPUs is a very bad idea. When we have 2 million particles, we split the dataset to smaller chunks (say 0.5 M particles each) and run four independent jobs. Such an "embarrassingly parallel" mode of operation is far more efficient than running a big job on 16 GPUs. Certainly parallelization efficiency of a single job can be improved, but somebody has to work on this.

RAM allocation is expected to be proportional to how much RAM is installed on a node and how many CPUs of the node are being used per task.

Please look at my example above. We just oversubscribe mpirun. So no CPU and RAM are wasted.

I see your points and welcome pull requests to address these issues but I'm afraid to say it is unlikely I myself can work on these issues. I just don't have time and motivation for it; most users (including SLURM users) are using RELION in the way I described above.

3dem / relion

RELION behaves like there is GPU sharing in Refine3D jobs when used on a slurm cluster causing VRAM errors #1074