ROCm / rocHPL

High Performance Linpack for Next-Generation AMD HPC Accelerators
Other
41 stars 20 forks source link

[Issue]: Srun failed at test #14

Open billcsm opened 2 weeks ago

billcsm commented 2 weeks ago

Problem Description

We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system. For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed. For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:

libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 96 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 112 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 64 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Logical CPU number 80 out of range ===============================

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD EPYC 7713 64-Core Processor X 2

GPU

AMD Instinct MI250X, AMD Instinct MI250

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm. For OpenMPI, its value was OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'

For Slurm, its value was OMP_PLACES = '{3}'