FredHutch / easybuild-life-sciences

Howto and implementation documentation
https://fredhutch.github.io/easybuild-life-sciences/
21 stars 6 forks source link

RELION 3.1.2 won't run GPU jobs but 3.1.0 will? #495

Closed cazumaya closed 2 years ago

cazumaya commented 3 years ago

If we try and run a job using a GPU in RELION (using grabnode here) it will exit with this error (below). Exact some job settings will run successfully on the same node if RELION/3.1.0...... is loaded instead. Example in /fh/scratch/delete90/campbell_m/caleigh_SR/Pearl/Tip/ Fail (3.1.2): Class2D/job022 Success (3.1.0): Class2D/job023

output file (run.out) RELION version: 3.1.2 Precision: BASE=double, CUDA-ACC=single

=== RELION MPI setup ===

error file (run.err) ERROR: unknown error in /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/ml_optimiser_mpi.cpp at line 126 (error-code 999) in: /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/acc/cuda/cuda_settings.h, line 67 ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,a
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

=== Backtrace === /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x63) [0x4bf963] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi() [0x4d9d45] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x23ff) [0x4e739f] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(main+0x4f) [0x4af1ef] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fe0e1460b97] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_start+0x2a) [0x4b1f4a]

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues


MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

bmcgough commented 3 years ago

This week?

fizwit commented 3 years ago

rebuild RELION 3.1.2 with -DCUDA_ARCH=61,

Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.

cazumaya commented 3 years ago

Hi John,

Thanks for working on this. The first thing that happens when I ml RELION now is (k92)

Command 'k5start' not found, but can be installed with:apt install kstartPlease ask your administrator.You will have to enable the component called 'universe'

If I try and run it from the grabnode GUI this error message happens (k92)

There are not enough slots available in the system to satisfy the 2slots that were requested by the application: /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpiEither request fewer slots for your application, or make more slotsavailable for use.A "slot" is the Open MPI term for an allocatable unit where we canlaunch a process. The number of slots available are defined by theenvironment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor coresIn all the above cases, if you want Open MPI to default to the numberof hardware threads instead of the number of processor cores, use the--use-hwthread-cpus option.Alternatively, you can use the --oversubscribe option to ignore thenumber of available slots when deciding the number of processes tolaunch.--------------------------------------------------------------------------

And finally, if I try to use my SLURM submission script, I think it's trying to run correctly? (k67)

Let me know if you need more info. Thanks! Caleigh

On Mon, Jul 26, 2021 at 12:37 PM John Dey @.***> wrote:

rebuild RELION 3.1.2 with -DCUDA_ARCH=61,

Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FredHutch/easybuild-life-sciences/issues/495#issuecomment-886972552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHCEB7MJ7KELPL6AV3KMT3TZW2OJANCNFSM47CKIDCQ .

cazumaya commented 3 years ago

Nevermind, with the SLURM script I eventually got this error ERROR: the provided PTX was compiled with an unsupported toolchain. in /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/projector.cpp at line 204 (error-code 222) in: /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/acc/cuda/cuda_settings.h, line 67 ERROR:

On Mon, Jul 26, 2021 at 2:03 PM Caleigh Azumaya @.***> wrote:

Hi John,

Thanks for working on this. The first thing that happens when I ml RELION now is (k92)

Command 'k5start' not found, but can be installed with:apt install kstartPlease ask your administrator.You will have to enable the component called 'universe'

If I try and run it from the grabnode GUI this error message happens (k92)

There are not enough slots available in the system to satisfy the 2slots that were requested by the application: /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpiEither request fewer slots for your application, or make more slotsavailable for use.A "slot" is the Open MPI term for an allocatable unit where we canlaunch a process. The number of slots available are defined by theenvironment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor coresIn all the above cases, if you want Open MPI to default to the numberof hardware threads instead of the number of processor cores, use the--use-hwthread-cpus option.Alternatively, you can use the --oversubscribe option to ignore thenumber of available slots when deciding the number of processes tolaunch.--------------------------------------------------------------------------

And finally, if I try to use my SLURM submission script, I think it's trying to run correctly? (k67)

Let me know if you need more info. Thanks! Caleigh

On Mon, Jul 26, 2021 at 12:37 PM John Dey @.***> wrote:

rebuild RELION 3.1.2 with -DCUDA_ARCH=61,

Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FredHutch/easybuild-life-sciences/issues/495#issuecomment-886972552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHCEB7MJ7KELPL6AV3KMT3TZW2OJANCNFSM47CKIDCQ .

fizwit commented 2 years ago

Issue resolved via Slurm resource request for GPU