3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
447 stars 199 forks source link

Relion-5.0: GPU memory errors in Class3D on subtomograms #1061

Open pherepanov opened 9 months ago

pherepanov commented 9 months ago

I encountered several frustraiting issues when attempting 3D classification of subtomograms in Relion-5.0

1) when subtomograms are created in old style (i.e. with "2D stacks" option set to "NO"): 3D classification invariably fails with GPU memory errors:

ERROR: CudaCustomAllocator out of memory [requestedSpace: 3420000 B] [largestContinuousFreeSpace: 3359232 B] [totalFreeSpace: 6432256 B]

This happens whether initial model is provided or not. In this case, GPU memory cannot be an issue, since my particles are only 36^3 pixels, and I only have 5-10k of them (depending on the job). The corresponding run_XXX_data.star files have all low _rlnNrOfSignificantSamples values (all well under 2000). Limiting --maxsig at 2000 or 500 does not help; reducing number of threads to 1 does not help either. The job always fails with the same error at a random iteration: sometimes in the very beginning (iteration 1-2), and sometimes at iteration 6-7 or anywhere in between. Submitting a Continue on such a failed job produces an immediate crash with the exact same error. It seems that Relion asks for just a little bit more space than is available as "continuous space" on the GPUs. By contrast, the same jobs proceed without errors on CPUs.

Running Relion-5.0 3D classification in an old project created with relion-4beta fails on a job that runs beautifully with relion-4beta (commit a26bd4). Yet, on the same job, the stable version of relion-4 (commit 138b9c) fails with the same errors as does relion-5.0. (The same job still runs with relion-4.0beta as before, with the same results). Something must have happened between a26bd4 and 138b9c that causes GPU memory errors on pseudo subtomograms.

2) when subtomograms are extracted as 2D stacks, and 3D classification is started without initial model ("None" in the intial model field), 3D classification does not fail with an error and goes on to completion, but without generating any 3D volumes. Instead, classes.mrcs are produced, which are stacks of 2D images (the stacks are made of the same number of 2D images as the number of 3D classes requested in the classification job). Is this an expected behaviour? Even when on good data, whcih would produce good classes, the 2D images are not very informative.

3) However, when subtomograms are generated as 2D stacks, and 3D classification is started from an initial 3D model, Relion5 fails just as described above with the same GPU memory errors.

Environment:

Dataset:

Job options:

two examples (with and without initial model):

  `which relion_refine_mpi` --o Class3D/job187/run --i PseudoSubtomo/job157/particles.star --tomograms ReconstructTomograms/job055/tomograms.star --ref model.mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --scratch_dir $TMPDIR --pool 3 --pad 2  --ctf --iter 35 --tau2_fudge 4 --particle_diameter 240 --K 5 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --j 1 --gpu ""  --sigma_tilt 20 --sigma_psi 20 --pipeline_control Class3D/job187/
`which relion_refine_mpi` --o Class3D/job191/run --i PseudoSubtomo/job157/particles.star --tomograms ReconstructTomograms/job055/tomograms.star --ini_high 60 --dont_combine_weights_via_disc --scratch_dir $TMPDIR --pool 3 --pad 2  --ctf --iter 35 --tau2_fudge 4 --particle_diameter 240 --K 3 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --j 1 --gpu ""  --sigma_tilt 10 --sigma_psi 10 --pipeline_control Class3D/job191/

Error message:

.... Expectation iteration 7 of 35 35/ 35 sec ............................................................~~(,,"> Maximization... 0/ 0 sec ............................................................~~(,,"> ERROR: CudaCustomAllocator out of memory [requestedSpace: 3420000 B] [largestContinuousFreeSpace: 3359232 B] [totalFreeSpace: 6432256 B] [3359232B] (3068928B) [3073024B] = 9501184B ERROR: CudaCustomAllocator out of memory [requestedSpace: 3420000 B] [largestContinuousFreeSpace: 3359232 B] [totalFreeSpace: 6432256 B] [3359232B] (3068928B) [3073024B] = 9501184B ERROR: CudaCustomAllocator out of memory [requestedSpace: 3420000 B] [largestContinuousFreeSpace: 3359232 B] [totalFreeSpace: 6432256 B] [3359232B] (3068928B) [3073024B] = 9501184B ERROR: CudaCustomAllocator out of memory [requestedSpace: 3420000 B] [largestContinuousFreeSpace: 3359232 B] [totalFreeSpace: 6432256 B] [3359232B] (3068928B) [3073024B] = 9501184B

**in: /camp/apps/misc/stp/sbstp/relion-5.0/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly

        (1.1e-8)*(N*2)^3  GB  

    of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig

    , flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

[gpu005:137014] Process received signal [gpu005:137014] Signal: Segmentation fault (11) [gpu005:137014] Signal code: Address not mapped (1) [gpu005:137014] Failing at address: 0x28 [gpu005:137014] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7ff0f3caa630] [gpu005:137014] [ 1] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN6AccPtrI8tComplexIdEE9freeIfSetEv+0x48)[0x4bb998] [gpu005:137014] [ 2] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x333e)[0x4ba3ce] [gpu005:137014] [ 3] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x8e1)[0x6275a1] [gpu005:137014] [ 4] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5a)[0x64591a] [gpu005:137014] [ 5] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fb)[0x47fdeb] [gpu005:137014] [ 6] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x105)[0x491d25] [gpu005:137014] [ 7] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(main+0x54)[0x44b014] [gpu005:137014] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0f31a6555] [gpu005:137014] [ 9] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi[0x44e65e] [gpu005:137014] End of error message in: /camp/apps/misc/stp/sbstp/relion-5.0/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly

        (1.1e-8)*(N*2)^3  GB  

    of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig

    , flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

[gpu005:137015] Process received signal [gpu005:137015] Signal: Segmentation fault (11) [gpu005:137015] Signal code: Address not mapped (1) [gpu005:137015] Failing at address: 0x28 [gpu005:137015] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f9b3d0f2630] [gpu005:137015] [ 1] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN6AccPtrI8tComplexIdEE9freeIfSetEv+0x48)[0x4bb998] [gpu005:137015] [ 2] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x333e)[0x4ba3ce] [gpu005:137015] [ 3] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x8e1)[0x6275a1] [gpu005:137015] [ 4] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5a)[0x64591a] [gpu005:137015] [ 5] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fb)[0x47fdeb] [gpu005:137015] [ 6] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x105)[0x491d25] [gpu005:137015] [ 7] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(main+0x54)[0x44b014] [gpu005:137015] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9b3c5ee555] [gpu005:137015] [ 9] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi[0x44e65e] [gpu005:137015] End of error message in: /camp/apps/misc/stp/sbstp/relion-5.0/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly

        (1.1e-8)*(N*2)^3  GB  

    of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig

    , flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

in: /camp/apps/misc/stp/sbstp/relion-5.0/src/acc/cuda/custom_allocator.cuh, line 539 ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion tries to distribute load over multiple GPUs to increase performance, but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run, and be particularly wary of 'device X is split between N followers', which will result in a higher memory cost on GPU X. In classifications, GPU- sharing between MPI-ranks is typically fine, whereas it will usually cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a too-big box-size for the GPU memory. Currently, N-pixel particle images will require roughly

        (1.1e-8)*(N*2)^3  GB  

    of memory (per rank) during the final iteration of refinement (using single-precision GPU code, which is default). 450-pixel images can therefore just about fit into a GPU with 8GB of memory, since 11(4502)^3 ~= 8.02 During classifications, resolution is typically lower and N is suitably reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have a very large number of orientations which are found as possible during the expectation step, which results in large arrays being needed on the GPU. If this is the case, you should find large (>10'000) values of '_rlnNrOfSignificantSamples' in your _data.star output files. You can try adding the --maxsig

    , flag, where P is an integer limit, but you should probably also consult expertise or re-evaluate your data and/or input reference. Seeing large such values means relion is finding nothing to align. If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues.

[gpu005:137016] Process received signal [gpu005:137016] Signal: Segmentation fault (11) [gpu005:137016] Signal code: Address not mapped (1) [gpu005:137016] Failing at address: 0x28 [gpu005:137013] Process received signal [gpu005:137013] Signal: Segmentation fault (11) [gpu005:137013] Signal code: Address not mapped (1) [gpu005:137013] Failing at address: 0x28 [gpu005:137016] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f000170a630] [gpu005:137016] [ 1] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN6AccPtrI8tComplexIdEE9freeIfSetEv+0x48)[0x4bb998] [gpu005:137016] [ 2] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x333e)[0x4ba3ce] [gpu005:137016] [ 3] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x8e1)[0x6275a1] [gpu005:137016] [ 4] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5a)[0x64591a] [gpu005:137016] [ 5] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fb)[0x47fdeb] [gpu005:137016] [ 6] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x105)[0x491d25] [gpu005:137016] [ 7] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(main+0x54)[0x44b014] [gpu005:137016] [ 8] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f0000c06555] [gpu005:137016] [ 9] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi[0x44e65e] [gpu005:137016] End of error message [gpu005:137013] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f68cf2a3630] [gpu005:137013] [ 1] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN6AccPtrI8tComplexIdEE9freeIfSetEv+0x48)[0x4bb998] [gpu005:137013] [ 2] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x333e)[0x4ba3ce] [gpu005:137013] [ 3] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x8e1)[0x6275a1] [gpu005:137013] [ 4] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5a)[0x64591a] [gpu005:137013] [ 5] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fb)[0x47fdeb] [gpu005:137013] [ 6] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x105)[0x491d25] [gpu005:137013] [ 7] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi(main+0x54)[0x44b014] [gpu005:137013] [ 8] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f68ce79f555] [gpu005:137013] [ 9] /camp/apps/misc/stp/sbstp/5.0beta-tomo-GPU/bin/relion_refine_mpi[0x44e65e] [gpu005:137013] End of error message srun: error: gpu005: tasks 2-4: Segmentation fault srun: error: gpu005: task 1: Segmentation fault srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: JOB 63128037 ON gpu005 CANCELLED AT 2023-12-31T17:13:26 slurmstepd: error: * STEP 63128037.0 ON gpu005 CANCELLED AT 2023-12-31T17:13:26 ***

pherepanov commented 9 months ago

An update on this issue: the GPU memory errors arise in Relion-5.0 when refining binned subtomograms. Refining unbinned subtomograms works fine. However, binned subtomograms can be refined with an earlier version of relion (I use 4.0beta, commit a26bd4) in the same project directory.

pherepanov commented 9 months ago

Final update: the bug was affecting relion_refine when box was <44 pixels. So if someone is having a smilar issue with relion-5.0beta, re-extract your subtomos with a larger box.

oleuns commented 8 months ago

Hey,

I am encountering similar issues, as described the exact same errors sometimes occur at a random interaction during 3D classification or 3D refinement using Relion v5. The dataset contains 37,000 particles with a box size of 256, 3 MPIs for 3D refinement, and 1 MPI for 3D classification. I tested this on an RTX3080 TI (11 GB VRAM) and Quadro RTX5000 (16 GB VRAM). I rerun the 3D classifications/3D refinements multiple times until the job finished, I could not identify a particular pattern.

pherepanov commented 8 months ago

Just to clarify: the issue with subtomo refinements I was experincing has been solved in the current beta release (thank you Sjors!).

oleuns commented 8 months ago

I am currently running RELION version: 5.0-beta-0-commit-90d239 (I hope this is the current beta release) and still encountered these issues in 3D classification and 3D auto refinement. However, this was SPA processing and not using subtomograms (sorry I forgot to mention this). Thanks for the response!

biochem-fan commented 8 months ago

@oleuns are you sure you replied to the right thread? Did you mean https://github.com/3dem/relion/issues/1061?

oleuns commented 8 months ago

@biochem-fan maybe I should open a new thread describing the issue that I encountered, the error message was just identical to the one described in #1061.