Closed oleuns closed 9 months ago
This happens when you have some really bad particles which are hard to align. There are many discussions about this on the CCPEM mailing list, for example: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2212&L=CCPEM&P=R43497&K=2 and https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1610&L=CCPEM&P=R52316.
Describe your problem
Hey, I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.
Environment:
Dataset:
Job options:
note.txt
in the job directory):Error message:
RELION version: 5.0-beta-0-commit-90d239 exiting with an error ...
hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539 ERROR: ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.
follower 2 encountered error: === Backtrace === /home/supervisor/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55584e7c528d] /home/supervisor/relion/build/bin/relion_refine_mpi(+0xe4ec4) [0x55584e79bec4] /home/supervisor/relion/build/bin/relion_refine_mpi(+0x3770cd) [0x55584ea2e0cd] /lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e) [0x7f1326665c0e] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1325894ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1325926850]
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly
of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig
, flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.
I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.
Job options:
note.txt
in the job directory):