Closed cazumaya closed 2 years ago
This week?
rebuild RELION 3.1.2 with -DCUDA_ARCH=61,
Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.
Hi John,
Thanks for working on this. The first thing that happens when I ml RELION now is (k92)
Command 'k5start' not found, but can be installed with:apt install kstartPlease ask your administrator.You will have to enable the component called 'universe'
If I try and run it from the grabnode GUI this error message happens (k92)
There are not enough slots available in the system to satisfy the 2slots that were requested by the application: /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpiEither request fewer slots for your application, or make more slotsavailable for use.A "slot" is the Open MPI term for an allocatable unit where we canlaunch a process. The number of slots available are defined by theenvironment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor coresIn all the above cases, if you want Open MPI to default to the numberof hardware threads instead of the number of processor cores, use the--use-hwthread-cpus option.Alternatively, you can use the --oversubscribe option to ignore thenumber of available slots when deciding the number of processes tolaunch.--------------------------------------------------------------------------
And finally, if I try to use my SLURM submission script, I think it's trying to run correctly? (k67)
Let me know if you need more info. Thanks! Caleigh
On Mon, Jul 26, 2021 at 12:37 PM John Dey @.***> wrote:
rebuild RELION 3.1.2 with -DCUDA_ARCH=61,
Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FredHutch/easybuild-life-sciences/issues/495#issuecomment-886972552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHCEB7MJ7KELPL6AV3KMT3TZW2OJANCNFSM47CKIDCQ .
Nevermind, with the SLURM script I eventually got this error ERROR: the provided PTX was compiled with an unsupported toolchain. in /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/projector.cpp at line 204 (error-code 222) in: /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/acc/cuda/cuda_settings.h, line 67 ERROR:
On Mon, Jul 26, 2021 at 2:03 PM Caleigh Azumaya @.***> wrote:
Hi John,
Thanks for working on this. The first thing that happens when I ml RELION now is (k92)
Command 'k5start' not found, but can be installed with:apt install kstartPlease ask your administrator.You will have to enable the component called 'universe'
If I try and run it from the grabnode GUI this error message happens (k92)
There are not enough slots available in the system to satisfy the 2slots that were requested by the application: /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpiEither request fewer slots for your application, or make more slotsavailable for use.A "slot" is the Open MPI term for an allocatable unit where we canlaunch a process. The number of slots available are defined by theenvironment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor coresIn all the above cases, if you want Open MPI to default to the numberof hardware threads instead of the number of processor cores, use the--use-hwthread-cpus option.Alternatively, you can use the --oversubscribe option to ignore thenumber of available slots when deciding the number of processes tolaunch.--------------------------------------------------------------------------
And finally, if I try to use my SLURM submission script, I think it's trying to run correctly? (k67)
Let me know if you need more info. Thanks! Caleigh
On Mon, Jul 26, 2021 at 12:37 PM John Dey @.***> wrote:
rebuild RELION 3.1.2 with -DCUDA_ARCH=61,
Can you please test with grabnode? 61 is specific to J nodes. The ARCH type of K nodes is 7.5. Please let me know the node name when testing.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FredHutch/easybuild-life-sciences/issues/495#issuecomment-886972552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHCEB7MJ7KELPL6AV3KMT3TZW2OJANCNFSM47CKIDCQ .
Issue resolved via Slurm resource request for GPU
If we try and run a job using a GPU in RELION (using grabnode here) it will exit with this error (below). Exact some job settings will run successfully on the same node if RELION/3.1.0...... is loaded instead. Example in /fh/scratch/delete90/campbell_m/caleigh_SR/Pearl/Tip/ Fail (3.1.2): Class2D/job022 Success (3.1.0): Class2D/job023
output file (run.out) RELION version: 3.1.2 Precision: BASE=double, CUDA-ACC=single
=== RELION MPI setup ===
Follower 1 runs on host = gizmoj27
RELION version: 3.1.2 exiting with an error ...
error file (run.err) ERROR: unknown error in /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/ml_optimiser_mpi.cpp at line 126 (error-code 999) in: /app/build/RELION/3.1.2/fosscuda-2020b/relion-3.1.2/src/acc/cuda/cuda_settings.h, line 67 ERROR:
A GPU-function failed to execute.
If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you
If this occurred at the middle or end of a run, it might be that
If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues
=== Backtrace === /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x63) [0x4bf963] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi() [0x4d9d45] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x23ff) [0x4e739f] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(main+0x4f) [0x4af1ef] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fe0e1460b97] /app/software/RELION/3.1.2-fosscuda-2020b/bin/relion_refine_mpi(_start+0x2a) [0x4b1f4a]
ERROR:
A GPU-function failed to execute.
If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you
If this occurred at the middle or end of a run, it might be that
If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.