Closed alexlicorteva closed 1 year ago
M60 should be OK but we have never tried it.
https://developer.nvidia.com/cuda-gpus says M60's compute capability is 5.2.
Please try CUDA_ARCH=52
.
If this is a real compatibility issue, the program should not run any iterations.
after a few iterations ERROR: out of memory
These messages suggest other problem(s).
How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.
Dear Relion Support, Thank you very much for the info! We used 5 MPIs with 6 threads each. We will try only one per GPU as you suggested. Alex
From: biochem_fan @.> Sent: Wednesday, April 26, 2023 4:28 PM To: 3dem/relion @.> Cc: Li, Alex @.>; Author @.> Subject: [EXTERNAL] Re: [3dem/relion] A GPU-function failed to execute (Issue #968)
If this is a real compatibility issue, the program should not run any iterations.
after a few iterations ERROR: out of memory
These messages suggest other problem(s).
How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/3dem/relion/issues/968*issuecomment-1524066401__;Iw!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LAgr72YO$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/A7OGKGPU5KSHYKMGKVX7P33XDGHM5ANCNFSM6AAAAAAXMSKKUM__;!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LK19ysGw$. You are receiving this because you authored the thread.Message ID: @.**@.>>
Dear Relion Support,
Refine3D went fine without specifying GPUs, and with 3MPIs 6 threads. The detailed Refine3D command is now:
which relion_refine_mpi
--o Refine3D/job046/run --auto_refine --split_random_halves --i Extract/job037/particles.star --ref Class3D/job030/run_it025_class002_box256.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --pad 2 --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale --j 6 --gpu "" --pipeline_control Refine3D/job046/
Thank you very much again for your help!
Regards,
Alex
From: biochem_fan @.> Sent: Wednesday, April 26, 2023 4:28 PM To: 3dem/relion @.> Cc: Li, Alex @.>; Author @.> Subject: [EXTERNAL] Re: [3dem/relion] A GPU-function failed to execute (Issue #968)
If this is a real compatibility issue, the program should not run any iterations.
after a few iterations ERROR: out of memory
These messages suggest other problem(s).
How many MPI processes do you use per GPU? Try only one (or two). Make sure other programs do not share the GPU.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/3dem/relion/issues/968*issuecomment-1524066401__;Iw!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LAgr72YO$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/A7OGKGPU5KSHYKMGKVX7P33XDGHM5ANCNFSM6AAAAAAXMSKKUM__;!!Eq8rgdkfa9r_yJvCTg!3xFxtm9L-4o6kTmsqLgpU8HsQ0QaGWtTeMdqptYlOnFQzsy9Pj7PDa1zCC2HOUk8haXihXwDOR7lEy81LK19ysGw$. You are receiving this because you authored the thread.Message ID: @.**@.>>
See also a recent discussion at CCPEM: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2304&L=CCPEM&O=D&P=63873
Recently I have tried to compile Relion v4.0.1 software on an AWS EC2 instance with the GPU support (Ubuntu v20.04, NVIDIA Tesla M60 GPUs, CUDA v11.7). There are no error during compilation, but at the runtime after a few iterations of Refind3D using relion30_tutorial data, Refind3D died with the error of "A GPU-function failed to execute" (for details, see the bottom). Recompiling with -DCUDA_ARCH=50 and -DCUDA_ARCH=35 did not help. I wonder if you could tell me that the OS type, OS version, GPU model detail, CUDA version that you can build Relion and run Refine3D with GPU enabled successfully so that I can try to use the same software and the hardware as yours.
Environment:
Dataset: relion30_tutorial data
Job options:
note.txt
in the job directory):Error message: ERROR: out of memory in /home/ccpem/src/relion/src/acc/cuda/custom_allocator.cuh at line 435 (error-code 2)
in: /home/ccpem/src/relion/src/acc/cuda/cuda_settings.h, line 65
ERROR:
A GPU-function failed to execute.
If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you
-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50 and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), this may happen.
-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with at least compute 3.5. You may be trying to use a GPU older than this. If you have multiple generations, try specifying --gpu
with X=0. Then try X=1 in a new run, and so on. The numbering of
GPUs may not be obvious from the driver or intuition. For a list
of GPU compute generations, see
en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so as to not require this, and may thus have unforeseen requirements when run in this mode. If you think it is nonetheless necessary, please consult the developers with this error.
f this occurred at the middle or end of a run, it might be that
-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is subject to many restrictions, and relion is written to work within common restraints. If you have exotic data or settings, unexpected configurations may occur. See also above point regarding double precision.
If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues
=== Backtrace ===
/opt/local/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55cc98c99f1d]
/opt/local/relion/bin/relion_refine_mpi(+0x2d9256) [0x55cc98ec3256]
/opt/local/relion/bin/relion_refine_mpi(_ZN14MlDeviceBundle24setupTunableSizedObjectsEm+0x664) [0x55cc98ec5134]
/opt/local/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1a49) [0x55cc98cbd919]
/opt/local/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xfb) [0x55cc98ccaa5b]
/opt/local/relion/bin/relion_refine_mpi(main+0x79) [0x55cc98c88479]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f9c3f07a083]
/opt/local/relion/bin/relion_refine_mpi(_start+0x2e) [0x55cc98c8bcee]