We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system.
For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed.
For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 96 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 112 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 64 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Logical CPU number 80 out of range
===============================
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD EPYC 7713 64-Core Processor X 2
GPU
AMD Instinct MI250X, AMD Instinct MI250
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm.
For OpenMPI, its value was
OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'
Problem Description
We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system. For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed. For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 96 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 112 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs libgomp: Logical CPU number 64 out of range libgomp: Invalid value for environment variable OMP_PLACES libgomp: Logical CPU number 80 out of range ===============================
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD EPYC 7713 64-Core Processor X 2
GPU
AMD Instinct MI250X, AMD Instinct MI250
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm. For OpenMPI, its value was OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'
For Slurm, its value was OMP_PLACES = '{3}'