EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.95k stars 1.02k forks source link

Runtime per step linearly increases with training step number. #1322

Open iPRET opened 1 week ago

iPRET commented 1 week ago

Describe the bug Training step time is takes linearly more time with each training step.

To Reproduce Steps to reproduce the behavior: My config for running training was: config.yml.txt Used slurm for launching. Script run by sbatch was: launch.sh.txt

Expected behavior Runtime per train step linearly increases based on train step. Plot of optimizer step time in ms, based on iteration [data parsed from training logs]: image Plot of forward step time in ms, based on iteration: image Plot of samples per second, based on iteration: image

Environment: Running code on the LUMI supercomputer. GPUs: Single node run with 8x AMD MI250X GPUs. Pip list: pip-list.txt Python version: 3.6.15 Also modified deeperspeed to wrap each launched training process in a singularity container with srun.

Additional info: Full training log: slurm3.txt

Any ideas what might be the issue?

iPRET commented 1 week ago

Update: it did not replicate with the same config on a different server with NVIDIA L40s. So I guess it's to do with something to do with the environment.