3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 201 forks source link

A GPU-function failed to execute #968

Closed alexlicorteva closed 1 year ago

alexlicorteva commented 1 year ago

Recently I have tried to compile Relion v4.0.1 software on an AWS EC2 instance with the GPU support (Ubuntu v20.04, NVIDIA Tesla M60 GPUs, CUDA v11.7). There are no error during compilation, but at the runtime after a few iterations of Refind3D using relion30_tutorial data, Refind3D died with the error of "A GPU-function failed to execute" (for details, see the bottom). Recompiling with -DCUDA_ARCH=50 and -DCUDA_ARCH=35 did not help. I wonder if you could tell me that the OS type, OS version, GPU model detail, CUDA version that you can build Relion and run Refine3D with GPU enabled successfully so that I can try to use the same software and the hardware as yours.

Environment:

Dataset: relion30_tutorial data

Job options:

Error message: ERROR: out of memory in /home/ccpem/src/relion/src/acc/cuda/custom_allocator.cuh at line 435 (error-code 2)

in: /home/ccpem/src/relion/src/acc/cuda/cuda_settings.h, line 65

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

        -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50            and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5),            this may happen.

        -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with            at least compute 3.5. You may be trying to use a GPU older than            this. If you have multiple generations, try specifying --gpu            with X=0. Then try X=1 in a new run, and so on. The numbering of            GPUs may not be obvious from the driver or intuition. For a list            of GPU compute generations, see            en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

        -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so            as to not require this, and may thus have unforeseen requirements            when run in this mode. If you think it is nonetheless necessary,            please consult the developers with this error.  

f this occurred at the middle or end of a run, it might be that

        -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is            subject to many restrictions, and relion is written to work within            common restraints. If you have exotic data or settings, unexpected            configurations may occur. See also above point regarding            double precision.

If none of the above applies, please report the error to the relion developers at    github.com/3dem/relion/issues

=== Backtrace  ===

/opt/local/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55cc98c99f1d]

/opt/local/relion/bin/relion_refine_mpi(+0x2d9256) [0x55cc98ec3256]

/opt/local/relion/bin/relion_refine_mpi(_ZN14MlDeviceBundle24setupTunableSizedObjectsEm+0x664) [0x55cc98ec5134]

/opt/local/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1a49) [0x55cc98cbd919]

/opt/local/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xfb) [0x55cc98ccaa5b]

/opt/local/relion/bin/relion_refine_mpi(main+0x79) [0x55cc98c88479]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f9c3f07a083]

/opt/local/relion/bin/relion_refine_mpi(_start+0x2e) [0x55cc98c8bcee]

biochem-fan commented 1 year ago

M60 should be OK but we have never tried it.

https://developer.nvidia.com/cuda-gpus says M60's compute capability is 5.2. Please try CUDA_ARCH=52.

biochem-fan commented 1 year ago

If this is a real compatibility issue, the program should not run any iterations.

after a few iterations ERROR: out of memory

These messages suggest other problem(s).

How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.

alexlicorteva commented 1 year ago

Dear Relion Support, Thank you very much for the info! We used 5 MPIs with 6 threads each. We will try only one per GPU as you suggested. Alex

From: biochem_fan @.> Sent: Wednesday, April 26, 2023 4:28 PM To: 3dem/relion @.> Cc: Li, Alex @.>; Author @.> Subject: [EXTERNAL] Re: [3dem/relion] A GPU-function failed to execute (Issue #968)

If this is a real compatibility issue, the program should not run any iterations.

after a few iterations ERROR: out of memory

These messages suggest other problem(s).

How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/3dem/relion/issues/968*issuecomment-1524066401__;Iw!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LAgr72YO$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/A7OGKGPU5KSHYKMGKVX7P33XDGHM5ANCNFSM6AAAAAAXMSKKUM__;!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LK19ysGw$. You are receiving this because you authored the thread.Message ID: @.**@.>>

alexlicorteva commented 1 year ago

Dear Relion Support,

Refine3D went fine without specifying GPUs, and with 3MPIs 6 threads. The detailed Refine3D command is now: which relion_refine_mpi --o Refine3D/job046/run --auto_refine --split_random_halves --i Extract/job037/particles.star --ref Class3D/job030/run_it025_class002_box256.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --pad 2 --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale --j 6 --gpu "" --pipeline_control Refine3D/job046/

Thank you very much again for your help!

Regards,

Alex

From: biochem_fan @.> Sent: Wednesday, April 26, 2023 4:28 PM To: 3dem/relion @.> Cc: Li, Alex @.>; Author @.> Subject: [EXTERNAL] Re: [3dem/relion] A GPU-function failed to execute (Issue #968)

If this is a real compatibility issue, the program should not run any iterations.

after a few iterations ERROR: out of memory

These messages suggest other problem(s).

How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/3dem/relion/issues/968*issuecomment-1524066401__;Iw!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LAgr72YO$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/A7OGKGPU5KSHYKMGKVX7P33XDGHM5ANCNFSM6AAAAAAXMSKKUM__;!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LK19ysGw$. You are receiving this because you authored the thread.Message ID: @.**@.>>

biochem-fan commented 1 year ago

See also a recent discussion at CCPEM: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2304&L=CCPEM&O=D&P=63873