Hi, I'm trying to run gpt-neox on LUMI HPC.
But I'm saddly getting errors that look like this:
GPU core dump failed
Memory access fault by GPU node-9 (Agent handle: 0x7d5f990) on address 0x14a1cfe01000. Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x7d5b060) on address 0x14c2c7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-11 (Agent handle: 0x810fd10) on address 0x152be7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-8 (Agent handle: 0x7d5c290) on address 0x15098be01000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x7d581a0) on address 0x153d9fe01000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0x7d5c100) on address 0x153e07e01000. Reason: Unknown.
I think the error is occuring during the training step.
Mainly I have two questions:
1) Can you give a pointer to a github repo (if it's public) that managed to launch gpt-neox on LUMI?
2) Is the process for launching on LUMI this? (LUMI uses slurm and requires using singularity containers):
Modify the deepspeed multinode runner to launch the train.py/eval.py/generate.py script in a singularity container.
Write "launcher": "slurm" and "deepspeed_slurm": true in the configuration yaml file.
Do sbatch on a script that contains deepy.py train.py confg.yml.
Previously I had some success in launching Megatron-Deepspeed training on LUMI.
But in Megatron-DeepSpeed the slurm task launching was under the control of the user.
I suspect maybe I'm doing gpt-neox launching incorrectly.
My current approach to launching gpt-neox is:
I have a conda environment activated on the LUMI login node with these packages:
I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in
a singularity container with the same packages as listed previously.
I set "launcher": "slurm" and "deepspeed_slurm": true in meg_conf.yml.
I've attached meg_conf.yml, ds_conf.yml and the full output.
Hi, I'm trying to run gpt-neox on LUMI HPC. But I'm saddly getting errors that look like this:
I think the error is occuring during the training step.
Mainly I have two questions: 1) Can you give a pointer to a github repo (if it's public) that managed to launch gpt-neox on LUMI? 2) Is the process for launching on LUMI this? (LUMI uses slurm and requires using singularity containers):
"launcher": "slurm"
and"deepspeed_slurm": true
in the configuration yaml file.Previously I had some success in launching Megatron-Deepspeed training on LUMI. But in Megatron-DeepSpeed the slurm task launching was under the control of the user. I suspect maybe I'm doing gpt-neox launching incorrectly.
My current approach to launching gpt-neox is: I have a conda environment activated on the LUMI login node with these packages:
I perform an sbatch on this script:
I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in a singularity container with the same packages as listed previously. I set
"launcher": "slurm"
and"deepspeed_slurm": true
in meg_conf.yml.I've attached meg_conf.yml, ds_conf.yml and the full output.
Any help would be appreciated.
Thanks! Ingus output.txt meg_conf.yml.txt ds_conf.yml.txt