Describe your problem

Hey, I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.

Environment:

OS: Ubuntu 22.04.2 LTS (both workstations)
MPI runtime: I am not sure
RELION version: 5.0-beta-0-commit-90d239 (both workstations)
Memory: 128GB/512GB
GPU: Quadro RTX5000/RTX3080TI

Dataset:

Box size: 220/350
Pixel size: 1.0152 Å/px/1.0152 Å/px
Number of particles: 70,000/130,000
Description: monomeric protein, around 120 kDa/ dimeric protein, around 140 kDa

Job options:

Type of job: 3D Auto-Refine
Number of MPI processes: 3
Number of threads: 10

Full command (see note.txt in the job directory):


++++ with the following command(s): 
`which relion_refine_mpi` --o Refine3D/job043/run --auto_refine --split_random_halves --blush  --i Select/job033/particles.star --ref Class3D/job028/run_it050_class005.mrc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /media/supervisor/DATA/Test --pool 10 --pad 2  --ctf --particle_diameter 140 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 16 --gpu "2" --reuse_scratch --pipeline_control Refine3D/job043/
++++

Error message:

ERROR: CudaCustomAllocator out of memory [requestedSpace: 64752640 B] [largestContinuousFreeSpace: 42727424 B] [totalFreeSpace: 166338048 B] (194048B) (195584B) [512B] (512B) (512B) [512B] (512B) (512B) (512B) (512B) [512B] (512B) [512B] (512B) [1024B] (512B) (512B) (512B) [1024B] (512B) (512B) (512B) [512B] (512B) (512B) [512B] (512B) [512B] (512B) (1024B) [512B] (512B) (512B) [512B] (1024B) (2048B) (1024B) [1536B] (1536B) (512B) [512B] (512B) (512B) (512B) [3072B] (512B) (13312B) [7680B] (4096B) (4096B) (2048B) (13312B) [58368B] (36864B) (36864B) [5120B] (6656B) [27648B] (3072B) [5632B] (6656B) [1536B] (3072B) [43520B] (56832B) (10240B) (10240B) [9216B] (194048B) (195584B) (113152B) [156672B] (80896B) (113152B) [6144B] (21504B) (43008B) (43008B) (43008B) (36864B) (36864B) (36864B) (36864B) [43520B] (43008B) (36864B) [22016B] (194048B) [99840B] (36864B) [10752B] (113152B) (113152B) [397824B] (36864B) [22016B] (194048B) (147456B) [46592B] (195584B) (194048B) [92672B] (194048B) (195584B) (195584B) [99328B] (194048B) (195584B) (161280B) [91648B] (194048B) (147456B) [46592B] (147456B) [147456B] (161280B) (161280B) (147456B) (147456B) (147456B) [124928B] (147456B) (161280B) (294912B) (147456B) (194048B) [562176B] (147456B) [163840B] (195584B) [636416B] (195584B) [397824B] (294912B) [291328B] (194048B) [147456B] (195584B) (294912B) (294912B) [459264B] (1903616B) [142848B] (3096576B) (294912B) (294912B) [294400B] (194048B) (195584B) (1903616B) (294912B) (294912B) [290816B] (194048B) (195584B) [294912B] (294912B) [119808B] (194048B) (195584B) (294912B) (294912B) (294912B) [440320B] (194048B) (194048B) (195584B) (195584B) (194048B) (195584B) [1140736B] (194048B) (195584B) [1330176B] (3096576B) (1327104B) (1327104B) [4610048B] (1327104B) [1511424B] (3096576B) (3082240B) [1680384B] (3096576B) (3096576B) [3011584B] (3096576B) (3096576B) [42727424B] (3096576B) [1004544B] (67702784B) (135405568B) (135405568B) (135405568B) (135405568B) (3096576B) (3096576B) [37564416B] (3082240B) [14336B] (65076736B) (88330752B) (130152960B) (19625472B) (130152960B) (176660992B) (19625472B) (176660992B) (130152960B) (130152960B) (176660992B) (176660992B) (3807232B) (3096576B) (3807232B) (3096576B) (3096576B) [5833216B] (9789952B) (19579392B) (19579392B) (19579392B) (19579392B) (31375872B) (31375872B) (16241664B) (16241664B) (39250944B) (39250944B) (32483328B) (32483328B) (3096576B) [30427648B] (62751232B) (62751232B) [29662208B] = 2725961728B

RELION version: 5.0-beta-0-commit-90d239 exiting with an error ...

hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.

Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
```
    (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539 ERROR: ERROR:

You ran out of memory on the GPU(s).

Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
```
    (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

follower 2 encountered error: === Backtrace === /home/supervisor/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55584e7c528d] /home/supervisor/relion/build/bin/relion_refine_mpi(+0xe4ec4) [0x55584e79bec4] /home/supervisor/relion/build/bin/relion_refine_mpi(+0x3770cd) [0x55584ea2e0cd] /lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e) [0x7f1326665c0e] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1325894ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1325926850]

ERROR:

You ran out of memory on the GPU(s).

Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
```
    (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.

Job options:

Type of job: 3D Classification
Number of MPI processes: 1
Number of threads: 16
Full command (see note.txt in the job directory):

3dem / relion

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.