[Question] Running gpt-neox on AMD-based LUMI HPC centre.

Hi, I'm trying to run gpt-neox on LUMI HPC. But I'm saddly getting errors that look like this:

GPU core dump failed
Memory access fault by GPU node-9 (Agent handle: 0x7d5f990) on address 0x14a1cfe01000. Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x7d5b060) on address 0x14c2c7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-11 (Agent handle: 0x810fd10) on address 0x152be7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-8 (Agent handle: 0x7d5c290) on address 0x15098be01000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x7d581a0) on address 0x153d9fe01000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0x7d5c100) on address 0x153e07e01000. Reason: Unknown.

I think the error is occuring during the training step.

Mainly I have two questions: 1) Can you give a pointer to a github repo (if it's public) that managed to launch gpt-neox on LUMI? 2) Is the process for launching on LUMI this? (LUMI uses slurm and requires using singularity containers):

Modify the deepspeed multinode runner to launch the train.py/eval.py/generate.py script in a singularity container.
Write "launcher": "slurm" and "deepspeed_slurm": true in the configuration yaml file.
Do sbatch on a script that contains deepy.py train.py confg.yml.

Previously I had some success in launching Megatron-Deepspeed training on LUMI. But in Megatron-DeepSpeed the slurm task launching was under the control of the user. I suspect maybe I'm doing gpt-neox launching incorrectly.

My current approach to launching gpt-neox is: I have a conda environment activated on the LUMI login node with these packages:

accelerate          0.31.0
annotated-types     0.7.0
apex                1.3.0
bitsandbytes        0.43.2.dev0
certifi             2022.12.7
charset-normalizer  2.1.1
contourpy           1.3.0
cupy                13.0.0b1
cycler              0.12.1
deepspeed           0.14.0
einops              0.8.0
exceptiongroup      1.2.2
fastrlock           0.8.2
filelock            3.16.0
flash_attn          2.6.3
fonttools           4.53.1
fsspec              2024.9.0
hjson               3.1.0
huggingface-hub     0.25.0
idna                3.4
iniconfig           2.0.0
Jinja2              3.1.4
kiwisolver          1.4.7
lion-pytorch        0.1.4
MarkupSafe          2.1.5
matplotlib          3.8.4
megatron-core       0.2.0
mpi4py              3.1.6
mpmath              1.3.0
networkx            3.3
ninja               1.11.1.1
numpy               1.26.4
packaging           24.1
pandas              2.2.3
pillow              10.4.0
pip                 24.2
pluggy              1.5.0
protobuf            5.27.1
psutil              6.0.0
py-cpuinfo          9.0.0
pybind11            2.13.1
pydantic            2.9.1
pydantic_core       2.23.3
pynvml              11.5.3
pyparsing           3.1.4
pytest              8.2.2
python-dateutil     2.9.0.post0
pytorch-triton-rocm 2.3.0+rocm6.2.0.1540b42334
pytz                2024.2
PyYAML              6.0.1
regex               2024.9.11
requests            2.28.1
safetensors         0.4.5
scipy               1.13.1
seaborn             0.13.2
sentencepiece       0.2.0
setuptools          72.1.0
six                 1.16.0
sympy               1.12.1
tokenizers          0.19.1
tomli               2.0.1
torch               2.3.0+rocm6.2.0
torchdata           0.7.1
torchtext           0.18.0+cpu
torchvision         0.18.0+rocm6.2.0
tqdm                4.64.1
transformers        4.41.2
typing_extensions   4.12.2
tzdata              2024.1
urllib3             1.26.13
wheel               0.43.0

I perform an sbatch on this script:

#!/bin/bash

#SBATCH --account project_465001281
#SBATCH --partition dev-g
#SBATCH --exclusive=user
#SBATCH --nodes=1
#SBATCH --gpus-per-node=mi250:8
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --mem=0
#SBATCH --time=00:30:00
#SBATCH --hint=nomultithread
#SBATCH --exclude=nid005138,nid006369,nid005796,nid007382

export MEMORY_OPT_ALLREDUCE_SIZE=125000000

export CUDA_DEVICE_MAX_CONNECTIONS=1

export CC=gcc-12
export CXX=g++-12

set -euo pipefail

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=9999
export WORLD_SIZE=$SLURM_NTASKS

export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
#export OMP_NUM_THREADS=1
#export NCCL_NET_GDR_LEVEL=PHB

module purge
module use /appl/local/training/modules/AI-20240529/
module load singularity-userfilesystems

CMD="/project/project_465001281/IP/gpt-neox/deepy.py \
  /project/project_465001281/IP/gpt-neox/train.py
  /project/project_465001281/IP/gpt-neox/launch_scripts/meg_conf.yml \
  /project/project_465001281/IP/gpt-neox/launch_scripts/ds_conf.yml
  "

$CMD

I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in a singularity container with the same packages as listed previously. I set "launcher": "slurm" and "deepspeed_slurm": true in meg_conf.yml.

I've attached meg_conf.yml, ds_conf.yml and the full output.

Any help would be appreciated.

Thanks! Ingus output.txt meg_conf.yml.txt ds_conf.yml.txt

EleutherAI / gpt-neox

[Question] Running gpt-neox on AMD-based LUMI HPC centre. #1310