Relion 5.0 + AMD MI210, AMD Milan, SLURM, OpenMPI cannot allocate on GPU

airus-pty-ltd commented 8 months ago

*Describe your problem**

We are seeing errors after initial spectra analysis phase on a class2d run. We have compiled Relion 5.0 by hand as per instructions on your web page. The issue seems to crop up as soon as the allocator attempts to start running on the GPU.

Environment:

OS: Rocky Linux, 8.8 - 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
MPI runtime: openmpi/4.1.5-gcc-12.3.0
RELION version: 5.0-beta-0-commit-84ba16
Memory: [e.g. 2TB]
GPU: [AMD MI210, 64GB HBM2 per card]

Dataset:

Not sure, but it is a 300GB dataset that has come off a JEOL CryoARM 300 II (JEM-3300)

Job options:

Type of job: Refine class2d
Number of MPI processes: 2
Number of threads: --j 4

#!/bin/bash -l

#SBATCH --tasks-per-node=2
#SBATCH --partition=gpu_rocm
#SBATCH --cpus-per-task=48
#SBATCH --account=a_rcc
#SBATCH --time=20:00:00
#SBATCH --mem=256G
#SBATCH --gres=gpu:mi210:1
#SBATCH --error=run.err
#SBATCH --output=run.out

module load foss/2023a
module load anaconda3/2022.05
module load rocm/5.7.1
module load cmake

export LD_LIBRARY_PATH=/opt/rocm/llvm/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/:$PATH
export ROCM_PATH=/opt/rocm/

srun -n 2 /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi --gpu 0  --o /scratch/user/user/benchmark/userdata/Class2D/job001/run --iter 25 --i Extract/job132/particles.star --dont_combine_weights_via_disc --pool 200 --pad 2  --ctf  --tau2_fudge 2 --particle_diameter 300 --K 100 --flatten_solvent  --zero_mask  --center_classes  --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale  --j 4 --pipeline_control /scratch/user/user/benchmark/userdata/Class2D/job001/

Error message:

[user@supercomputer userdata]$ cat run.err 
srun: ROUTE: split_hostlist: hl=bun001 tree_width 0
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[bun001:3003930] *** Process received signal ***
[bun001:3003930] Signal: Aborted (6)
[bun001:3003930] Signal code:  (-6)
[bun001:3003930] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7fc3acf2bcf0]
[bun001:3003930] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fc3acba2acf]
[bun001:3003930] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fc3acb75ea5]
[bun001:3003930] [ 3] /sw/auto/rocky8c/epyc3_mi210/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xa9a69)[0x7fc3ad657a69]
[bun001:3003930] [ 4] /sw/auto/rocky8c/epyc3_mi210/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb50da)[0x7fc3ad6630da]
[bun001:3003930] [ 5] /sw/auto/rocky8c/epyc3_mi210/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb5145)[0x7fc3ad663145]
[bun001:3003930] [ 6] /sw/auto/rocky8c/epyc3_mi210/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb5398)[0x7fc3ad663398]
[bun001:3003930] [ 7] /sw/auto/rocky8c/epyc3_mi210/software/GCCcore/12.3.0/lib64/libstdc++.so.6(_ZSt28__throw_bad_array_new_lengthv+0x0)[0x7fc3ad65a13b]
[bun001:3003930] [ 8] /opt/rocm-5.7.1/lib/librocfft.so.0(+0x1ee246)[0x7fc39fe36246]
[bun001:3003930] [ 9] /opt/rocm-5.7.1/lib/librocfft.so.0(+0x1ea3cd)[0x7fc39fe323cd]
[bun001:3003930] [10] /opt/rocm-5.7.1/lib/librocfft.so.0(rocfft_setup+0x25)[0x7fc39fdc14f5]
[bun001:3003930] [11] /opt/rocm-5.7.1/lib/libhipfft.so.0(+0x5597)[0x7fc3afa6b597]
[bun001:3003930] [12] /opt/rocm-5.7.1/lib/libhipfft.so.0(+0x7cda)[0x7fc3afa6dcda]
[bun001:3003930] [13] /opt/rocm-5.7.1/lib/libhipfft.so.0(hipfftPlanMany+0x18a)[0x7fc3afa6a9da]
[bun001:3003930] [14] /opt/rocm-5.7.1/lib/libhipfft.so.0(hipfftEstimateMany+0x7a)[0x7fc3afa6c1ba]
[bun001:3003930] [15] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x41b)[0x5b644b]
[bun001:3003930] [16] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0xec)[0x4f5cbc]
[bun001:3003930] [17] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5d)[0x4888dd]
[bun001:3003930] [18] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x159)[0x413849]
[bun001:3003930] [19] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x336)[0x423776]
[bun001:3003930] [20] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(main+0x55)[0x3a4135]
[bun001:3003930] [21] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7fc3acb8ed85]
[bun001:3003930] [22] /scratch/user/user/local_software_installs/relion_5.0_gfx90a_20231121b/bin/relion_refine_mpi(_start+0x2e)[0x3a403e]
[bun001:3003930] *** End of error message ***
srun: error: bun001: task 1: Aborted (core dumped)
slurmstepd: error:  mpi/pmix_v3: _errhandler: bun001 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -25, source = [slurm.pmix.6550605.0:1]
slurmstepd: error: *** STEP 6550605.0 ON bun001 CANCELLED AT 2023-11-21T11:35:11 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: bun001: task 0: Killed

scheres commented 7 months ago

Does the same happen without MPI?

airus-pty-ltd commented 7 months ago

Unfortunately, yes. When running a straight "relionrefine with no srun/mpi ranks/-np and so forth, we still end up with...

" Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
 Estimating initial noise spectra from at most 1000 particles 
   0/   0 sec ............................................................~~(,_,">
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
"

...and the point where it tries to load onto the MI210 GPU HBM memory.

scheres commented 7 months ago

OK, then the problem has nothing to do with the openmpi. Perhaps @suyashtn can comment?

suyashtn commented 7 months ago

Hi @airus-pty-ltd , it looks like your dataset is ~300GB, which is quite large to fit on a single MI210. Also your batch script requests 256GB, so my best guess it that you are running out of memory. Hence the std::bad_alloc is reporting failure to allocate storage. Would be better to use multiple GPUs for this dataset.

airus-pty-ltd commented 7 months ago

I tried to allocate 480GB of system memory, then use 2 * Mi210 GPU's (each with 64GB of HBM2 memory) and still got...

 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
 Estimating initial noise spectra from at most 1000 particles 
   0/   0 sec ............................................................~~(,_,">
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Further, at the point of core dump and alloc fail, I was monitoring system memory consumption.

[root@node ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:          376Gi        21Gi       322Gi        48Mi        33Gi       353Gi
Swap:         4.0Gi          0B       4.0Gi

To this end, I've convinced myself that this is not about system memory malloc().

I do wonder though. I can run this dataset fine, with the same parameters on an nvidia A100 or L40 or H100 - the L40 only has 48GB of memory. If I span it across three 48GB cards, it runs fine. If I try to span this across 2 * 64GB cards for AMD, it crashes out.

What am I missing? The dimensionality of memory alloc doesn't seem to be the issue here, unless 3 48GB = 144GB is somehow drastically different to 2 64GB = 128GB, neither of which relate to the 300GB data set this way.

Ideas?

biochem-fan commented 7 months ago

@suyashtn

it looks like your dataset is ~300GB, which is quite large to fit on a single MI210.

Because we don't load all particles into a GPU simultaneously, the total number of particles should not affect GPU memory requirement. What matters is the box size.

airus-pty-ltd commented 7 months ago

@suyashtn

it looks like your dataset is ~300GB, which is quite large to fit on a single MI210.

Because we don't load all particles into a GPU simultaneously, the total number of particles should not affect GPU memory requirement. What matters is the box size.

Yes, and coupled with the fact that I've allocated almost 500GB of system memory (DRAM) to this problem, on a 512GB node, I don't think this is the problem and our issue may be elsewhere. I hope we can solve it. We really want to make good with Relion 5.0 and AMD! There are just too many issues with relying on nvidia alone in this day and age to be sustainable for scientific communities, in this day and age...

biochem-fan commented 7 months ago

Does this happen on a tiny basic dataset, such as our beta-galactosidase tutorial dataset or ribosome Class3D benchmark? The code is often tested against these datasets.

airus-pty-ltd commented 7 months ago

Does this happen on a tiny basic dataset, such as our beta-galactosidase tutorial dataset or ribosome Class3D benchmark? The code is often tested against these datasets.

Happy to try! Do you have a link to the data-set and some parameters to run with?

biochem-fan commented 7 months ago

https://www3.mrc-lmb.cam.ac.uk/relion//index.php?title=Benchmarks_%26_computer_hardware#Standard_benchmarks

3dem / relion

Relion 5.0 + AMD MI210, AMD Milan, SLURM, OpenMPI cannot allocate on GPU #1036