cvsindelar commented 4 years ago

Hi there, I discovered that, at least on our machines, Relion 3 and 3.1 (both) are entirely unable to run GPU-accelerated 3D classifications if the --zero_mask option is not used.

I attach a simple 2MB data set that reproducibly generates this problem on multiple different machines, with multiple different RELION builds. The command runs successfully if either (1) no gpu option is given or (2) --zero_mask option is used. relion_gpu_crash.zip

Please write a clear description of what the problem is.

Environment:

OS: NAME="Red Hat Enterprise Linux Server", VERSION="7.6 (Maipo)"
MPI runtime: OpenMPI/3.1.1
RELION version 3.0.8, 3.1beta
Memory: [e.g. 120 GB]
GPU: GTX 1080

Dataset:

Box size: probably all (32, 128 tested)
Pixel size: 2.12, 6.35 tested
Number of particles: 565
Description: no symmetry used (microtubule segments run in single-particle mode)

Job options:

Type of job: Class3D
Number of MPI processes: 1 or more
Number of threads: 1
Full command (see note.txt in the job directory):

relion_refine --o ./output --i particles_mini.star --ref mt_reconstruct_20A_bin12.mrc --firstiter_cc --healpix_order 1 --j 1 --gpu ""

gpu-ids not specified, threads will automatically be mapped to devices (incrementally). Thread 0 mapped to device 0 Running CPU instructions in double precision. Estimating initial noise spectra 0/ 0 sec ............................................................~~(,_,"> CurrentResolution= 162.563 Angstroms, which requires orientationSampling of at least 25.7143 degrees for a particle of diameter 685.811 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 16704 OrientationalSampling= 30 NrOrientations= 576 TranslationalSampling= 2 NrTranslations= 29

Oversampling= 1 NrHiddenVariableSamplingPoints= 534528 OrientationalSampling= 15 NrOrientations= 4608 TranslationalSampling= 1 NrTranslations= 116

Expectation iteration 1 of 50 000/??? sec ~~(,_,"> [oo]KERNEL_ERROR: out of memory in /dev/shm/be59/build/RELION/3.0.8/fosscuda-2018b/relion-3.0.8/src/acc/utilities_impl.h at line 253 (error-code 2) in: /dev/shm/be59/build/RELION/3.0.8/fosscuda-2018b/relion-3.0.8/src/acc/cuda/cuda_settings.h, line 81 in: /dev/shm/be59/build/RELION/3.0.8/fosscuda-2018b/relion-3.0.8/src/acc/cuda/cuda_settings.h, line 81 === Backtrace === relion_refine(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x66) [0x43d806] relion_refine(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0xf3) [0x4a46f3] relion_refine(_Z11_threadMainPv+0x36) [0x4c36a6] /lib64/libpthread.so.0(+0x7dd5) [0x2aba23d00dd5] /lib64/libc.so.6(clone+0x6d) [0x2aba246b602d]

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

biochem-fan commented 4 years ago

Thanks for a test case. This is very useful for testing.

Unfortunately, 0001.mrcs seems broken. Can you double-check the content of the archive?

cvsindelar commented 4 years ago

relion_gpu_crash_fix.zip

My apologies! The current attachment should fix this.

bforsbe commented 4 years ago

omitting --zero_mask makes calls to CUB, which requires dynamic allocation of GPU memory. If relion is grabbing most of the memory to manage though its own allocator, then my guess is that there's not enough dynamic memory left. I recall that we increased the dynamic allocation space when non-zero-masking was moved to GPUs, but not the specifics.

With a 1080 you should have plenty, I'm just clarifying the context of this "out of memory" error. Hope that helps.

cvsindelar commented 4 years ago

Hi Bjorn, I just heard back from our cluster administrator that he found a couple NVIDIA card models where the example ran OK- I will pass this along when I know. Certainly the 2MB test data set should not over tax the memory on the GTX 1080, at least I hope so! :)

bforsbe commented 4 years ago

No but that's sort of the point; relion doesn't know your input is 2MB, so it takes a big chunk of the GPU as a big static allocation, leaving some dynamic allocation space. If other programs or circumstance reduce this dynamic allocation space even further, you could still run out. There is a flag --free_gpu_memory which specifies a number of Mb extra dynamic allocation space. YOu could always try using that. --free_gpu_memory 1000 or so. I'm not saying you should have to, but it's a diagnostic at least.

cvsindelar commented 4 years ago

Here is a list of which graphics cards successfully ran the test case above. I'll check to see whether that '--free_gpu_memory 1000' helps things.

K80: PASS RTX 2080: PASS

GTX 1080Ti: FAIL RTX 5000: FAIL RTX 8000: FAIL P100: FAIL V100: FAIL TITAN V: FAIL

cvsindelar commented 4 years ago

Indeed, adding the option '--free_gpu_memory 1000' to the relion_refine command fixes the problem. This is a useable workaround. Not that I actually prefer the non-zero-masked method... it was just how I first thought to try it. Thanks Bjorn.

bforsbe commented 4 years ago

"Non-zero-masking" means masking by random noise, and generating random numbers on the GPU does require some extra space. Not sure how much, but clearly it can be come an issue. At the time we implemented this, we set parameters that we thought were conservative. The fact that it seems not to be is another argument for a more elaborate memory estimation that makes the static allocation less greedy. I'm really surprised that some cards work and others don't though...

biochem-fan commented 4 years ago

Indeed. I often use non-zero masking (this improves the resolution of some membrane proteins by reducing overfitting) without problems. We use 1080Ti, 1080, 2080Ti.

rui--zhang commented 3 years ago

Hi, on a new machine with Cuda11.1, GeForce RTX 3090 and relion3.1.1, I am having exact the same issue in Refine3D: if using non-zero masking, GPU run out of memory. The particle box size doesn't seem to make a difference. Not using non-zero masking or specifying '--free_gpu_memory 1000' can fix the issue.

Here is the error message:

 Auto-refine: Resolution= 10.0206 (no gain for 0 iter) 
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter) 
 Estimating accuracies in the orientational assignment ... 
   0/   0 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.3085 degrees; offsets= 0.60554 Angstroms
 CurrentResolution= 10.0206 Angstroms, which requires orientationSampling of at least 1.65899 degrees for a particle of diameter 690 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 2730
 OrientationalSampling= 1.875 NrOrientations= 130
 TranslationalSampling= 2.74 NrTranslations= 21
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 87360
 OrientationalSampling= 0.9375 NrOrientations= 1040
 TranslationalSampling= 1.37 NrTranslations= 84
=============================
 Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo]KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)
KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)
KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)
KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)
KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)

 RELION version: 3.1.1-commit-9f3bf1
 exiting with an error ...

 RELION version: 3.1.1-commit-9f3bf1
 exiting with an error ...
KERNEL_ERROR: out of memory in /home/install/code/relion-3.1/src/acc/utilities_impl.h at line 253 (error-code 2)

 RELION version: 3.1.1-commit-9f3bf1
 exiting with an error ...

 RELION version: 3.1.1-commit-9f3bf1
 exiting with an error ...

biochem-fan commented 3 years ago

Non-zero masking makes the probability distribution wider and requires more memory especially when particles are difficult to align.

Possible solutions:

Enable it only at high resolution by starting/continuing with weaker low pass filtering and/or local angular search
Limit the number of angles to consider (e.g. --maxsig 5000)

rui--zhang commented 3 years ago

Hi, thank you for the prompt reply! I was actually doing local angular search (0.9 degrees). Using --maxsig 5000 didn't fix the issue, still the same error.

This is the command I was using: which relion_refine_mpi --o Refine3D/job013/run --auto_refine --split_random_halves --i particles_reorder_fre2relion.star --ref Class3D/job005/run_it001_class001.mrc --ini_high 10 --dont_combine_weights_via_disc --scratch_dir /ssd --pool 3 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --particle_diameter 690 --flatten_solvent --solvent_mask shapeMask_nx512_CP/mask3D_500x281.mrc --solvent_correct_fsc --oversampling 1 --healpix_order 5 --auto_local_healpix_order 5 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --helix --helical_outer_diameter 500 --ignore_helical_symmetry --sigma_tilt 5 --sigma_psi 3.33333 --sigma_rot 0 --helical_keep_tilt_prior_fixed --j 3 --gpu "" --dont_check_norm --keep_scratch --reuse_scratch --pipeline_control Refine3D/job013/

biochem-fan commented 3 years ago

What happens if you run Refine3D with zero-masking, stop it, and continue with non-zero masking from an intermediate iteration where the resolution is 4 A or so?

rui--zhang commented 3 years ago

Let me try it. I forgot to mention that the exact same data/command runs perfect fine on another machine with GTX 1080Ti, Cuda10.1 and relion3.1.1 (same version), so this issue seems to be related to the hardware/CUDA. Trying relion3.1.1 compiled with Cuda10.1 on the new machine with RTX3090 gives a different error message:

in: /home/install/code/relion-3.1-cu10.1/src/acc/cuda/cuda_fft.h, line 224 ERROR:

When trying to plan one or more Fourier transforms, it was found that the available GPU memory was insufficient. Relion attempts to reduce the memory by segmenting the required number of transformations, but in this case not even a single transform could fit into memory. Either you are (1) performing very large transforms, or (2) the GPU had very little available memory.

(1) may occur during autopicking if the 'shrink' parameter was set to 1. The 
recommended value is 0 (--shrink 0), which is argued in the RELION-2 paper (eLife).
This reduces memory requirements proportionally to the low-pass used. 

(2) may occur if multiple processes were using the same GPU without being aware
of each other, or if there were too many such processes. Parallel execution of 
relion binaries ending with _mpi ARE aware, but you may need to reduce the number
of mpi-ranks to equal the total number of GPUs. If you are running other instances 
of GPU-accelerated programs (relion or other), these may be competing for space.
Relion currently reserves all available space during initialization and distributes
this space across all sub-processes using the available resources. This behaviour 
can be escaped by the auxiliary flag --free_gpu_memory X [MB]. You can also go 
further and force use of full dynamic runtime memory allocation, relion can be 
built with the cmake -DCachedAlloc=OFF

in: /home/install/code/relion-3.1-cu10.1/src/acc/cuda/cuda_fft.h, line 224 ERROR: ERROR:

biochem-fan commented 3 years ago

Didn't you specify CUDA_ARCH in cmake? 3090 and 1080 belong to different GPU architectures, so you either need PTX in your binary or compile for the specific compute capability of each card. See https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

rui--zhang commented 3 years ago

Yes, we used -DCUDA_ARCH=86

biochem-fan commented 3 years ago

For 3080, you have to use CUDA >= 11.1. https://forums.developer.nvidia.com/t/can-rtx-3080-support-cuda-10-1/155849

Compile with it and also make sure with ldd that you are using the right runtime.

rui--zhang commented 3 years ago

Here is the ldd result: Can you spot anything wrong?

zhangrui@sp3.wustl.edu:/usr/local/relion/bin$ ldd relion_refine_mpi linux-vdso.so.1 (0x00007fff04bbe000) libcufft.so.10 => /usr/local/cuda-11.1/lib64/libcufft.so.10 (0x00007ff7d3226000) libmpi.so.40 => /opt/openmpi/4.0.5/lib/libmpi.so.40 (0x00007ff7d30fc000) libtiff.so.5 => /lib/x86_64-linux-gnu/libtiff.so.5 (0x00007ff7d3060000) libfftw3.so.3 => /usr/local/relion/lib/libfftw3.so.3 (0x00007ff7d2eab000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff7d2e86000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff7d2e7b000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff7d2e75000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff7d2c94000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff7d2b45000) libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007ff7d2b03000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff7d2ae6000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff7d28f4000) /lib64/ld-linux-x86-64.so.2 (0x00007ff7e1cb8000) libopen-rte.so.40 => /opt/openmpi/4.0.5/lib/libopen-rte.so.40 (0x00007ff7d2838000) libopen-pal.so.40 => /opt/openmpi/4.0.5/lib/libopen-pal.so.40 (0x00007ff7d2780000) libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007ff7d272f000) libwebp.so.6 => /lib/x86_64-linux-gnu/libwebp.so.6 (0x00007ff7d24c6000) libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007ff7d241b000) liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007ff7d23f2000) libjbig.so.0 => /lib/x86_64-linux-gnu/libjbig.so.0 (0x00007ff7d21e4000) libjpeg.so.8 => /lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007ff7d215f000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ff7d2143000) libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x00007ff7d2109000) libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ff7d2104000) libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007ff7d20ff000) libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007ff7d20d2000) libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007ff7d20c7000)

biochem-fan commented 3 years ago

ldd looks fine. Sorry, I have no idea. We don't have 3080s at hand so cannot investigate locally.

rui--zhang commented 3 years ago

OK. No worries! I can use "--free_gpu_memory 1000" for now without any issue. Thanks!

biochem-fan commented 3 years ago

I mentioned this to @arom4github, our collaborator in NVIDIA, to see if something changed in cuRAND.

biochem-fan commented 3 years ago

@arom4github commented this:

rui--zhang said nothing about number of mpi ranks he used. Probably it would be enough for him to have one mpi rank per GPU.

(I assumed you were doing so, but just to make sure)

biochem-fan commented 3 years ago

In CCPEM and Twitter, there are several reports that RELION is running fine with 30x0 cards. But I am not sure if they tried non-zero masking. Can you try non-zero masking on our tutorial dataset? Does it run fine?

rui--zhang commented 3 years ago

@arom4github commented this:

rui--zhang said nothing about number of mpi ranks he used. Probably it would be enough for him to have one mpi rank per GPU.

(I assumed you were doing so, but just to make sure)

I tried one mpi rank per GPU, still the same error.

cvsindelar commented 3 years ago

Hi, for what it's worth, we run consistently into this error when we try to omit zero-masking and use our GPUs of multiple flavors. This is irrespective of particle dimension or bin factor (including very highly binned data with tiny dimensions). Non-GPU and/or zero-masked runs all work fine, so I think this points towards a GPU bug, not a memory limitation. - Chuck

From: Rui Zhang notifications@github.com Sent: Friday, February 12, 2021 11:26 AM To: 3dem/relion relion@noreply.github.com Cc: Sindelar, Charles charles.sindelar@yale.edu; Author author@noreply.github.com Subject: Re: [3dem/relion] GPU memory error if --zero_mask not used (#619)

@arom4githubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farom4github&data=04%7C01%7Ccharles.sindelar%40yale.edu%7C2fe9895b6fd04f12a87508d8cf72e6ba%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637487439705457458%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HygwkCMxyV13m4hWOnMr2HlYjT%2Fh6DQYx88bffDdsts%3D&reserved=0 commented this:

rui--zhang said nothing about number of mpi ranks he used. Probably it would be enough for him to have one mpi rank per GPU.

(I assumed you were doing so, but just to make sure)

I tried one mpi rank per GPU, still the same error.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F3dem%2Frelion%2Fissues%2F619%23issuecomment-778295806&data=04%7C01%7Ccharles.sindelar%40yale.edu%7C2fe9895b6fd04f12a87508d8cf72e6ba%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637487439705457458%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wfhZ1OsxClpY7xBOh14D7I72GQLdqTS0M3x1ppO0kgg%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABJR7PZCY6IKOZKJHQHRIE3S6VJB5ANCNFSM4M23C4DQ&data=04%7C01%7Ccharles.sindelar%40yale.edu%7C2fe9895b6fd04f12a87508d8cf72e6ba%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637487439705467414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Mly48C7iN8QarV7hlPoskmvbt9MIyyJ%2F1UGrgynmp%2B4%3D&reserved=0.

biochem-fan commented 3 years ago

@cvsindelar

use our GPUs of multiple flavors

Which GPU?

On our system with 1080 Ti or 2080 TI, it works fine. As I wrote above, can you test non-zero masking on our beta-galactosidase tutorial dataset?

rui--zhang commented 3 years ago

One more comment on non-zero-masking (why I care about it): When I use the divide-and-conquer strategy to calculate different pieces for a big structure, if using zero-masking, the grey level (and the noise level) of different pieces tend not to match each other, making the final stitched map looks bad.

biochem-fan commented 3 years ago

I am against the use of composite/stitched/Frankenstein maps because interface is not well defined. It is fine to make a low resolution overview for Supplementary Figures but please don't refine atomic models against it.

the grey level (and the noise level) of different pieces

Is this run_class001.mrc or PostProcessed map? Also note that different resolutions and sharpening B factors lead to different grey level and background noise level.

Since backprojection is done with unmasked particles, I don't know why non-zero masking changes the output map.

rui--zhang commented 3 years ago

I agree the interface is not well preserved. The composite map is mainly used for figure making and map deposition.

For our doublet microtubule dataset, the grey level (and the background noise) of run_class001.mrc do not perfectly match with each other if using zero-masking. https://www.sciencedirect.com/science/article/pii/S0092867419310815 (see Fig. S1, the whole structure was divided into 30 pieces)

3dem / relion

GPU memory error if --zero_mask not used #619

Oversampling= 1 NrHiddenVariableSamplingPoints= 534528 OrientationalSampling= 15 NrOrientations= 4608 TranslationalSampling= 1 NrTranslations= 116