Why does running Lightning on SLURM with python perform worse than with srun?

Muennighoff commented 9 months ago

Bug description

I'm training LLMs across multiple GPUs on a single node using Nvidia/NeMo. When launching via python train.py inside of an allocation I get much worse performance than when launching directly via srun. In the first case Pytorch Lightning also raises the warning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun. See below:

107 tflops:

srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 --pty bash
source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 python train.py

172 tflops:

source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 python -u train.py

Why is it the case that the first one performs worse? Maybe is there a difference in how these two strategies launch the torch.distributed process group? (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing)

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment

``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @awaelchli

awaelchli commented 9 months ago

I don't have an answer to your question, but I find the first way of launching very strange, I didn't know this worked and I am surprised it does. If you ever find out why I'd be interested.

But I hope our guide here is good. It's our recommended way to launch multinode jobs.

stas00 commented 9 months ago

@Muennighoff, what's inside train.py?

The way you're launching it in many frameworks leads to DP and not DDP (but perhaps not in PTL). So you might not comparing apples to apples.

To surely compare srun you need an actual multi-gpu launcher, e.g. torchrun

nb: @awaelchli has just fixed torchrun running under slurm env https://github.com/Lightning-AI/lightning/pull/18618 If you want a workaround for not using this fix as PTL@HEAD won't work with Nemo - set SLURM_JOB_NAME=interactive torchrun ... but if you're using standalone PTL then just use its bleed edge.

konstantinjdobler commented 9 months ago

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

@awaelchli As an additional datapoint for the use case of launching in an allocation: I am running jobs on multiple clusters, some using SLURM and some don't. It's just easier if all jobs can be launched the same way, regardless of if SLURM is available or not.

stas00 commented 9 months ago

Thank you for sharing your insights, Konstantin

In general the problem is that https://github.com/nviDIA/nemo, which is what we use, abstracts all PTL bits away, giving the user an API that doesn't have PTL exposed, so we only have a config file to make any customizations and thus such coding workarounds aren't available to us.

konstantinjdobler commented 9 months ago

My bad, I didn't realize nvidia/nemo has this limitation. Perhaps this will be useful to someone else who stumbles upon this issue due to issues with SLURM. I know it took me some time to figure out why things were working on some clusters but not on those using SLURM.

awaelchli commented 9 months ago

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

Don't want to diverge too much from the OP's issue, but just wanted to hint at the fact that if you run interactively on SLURM (i.e. scheduling a machine and then connecting to it), there is simpler way than changing the code like you described, by just setting the job name to "interactive" and then Lightning will know that processes are not getting launched externally, and you won't have to change the code and it will just be like running locally on the workstation. https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode (apologies if you see rendering issues in the docs, we are currently going through a couple of style template updates).

Kin-Zhang commented 7 months ago

I also find the srun will speed up. but meanwhile loss curve doesn't converge quickly. Here is the two figure, btw: if you are logging with wandb and run srun in sbatch, system logging in wandb can only show one GPU utilization while of course based on speed, we use all of them.

w/o srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
python 1_train.py

w srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
srun python 1_train.py

schlabrendorff commented 6 months ago

For me the bottleneck was that by default in my SLURM cluster a task only gets 1 CPU core. When not using srun, python failed to use multiple GPUs. But using srun, each task was only allocated one CPU core (default of my slurm cluster config), resulting in a dataloader bottleneck.

Hidden in the documentation of sbatch it says that -c, --cpus-per-task=<ncpus> as a flag is ignored, but one needs to set the SRUN_CPUS_PER_TASK environment variable.

I achieved good performance and full GPU utilization with this script:

#!/usr/bin/bash -l

#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=a100:4
#SBATCH --mem=180G
#SBATCH --cpus-per-task=8
export SRUN_CPUS_PER_TASK=8

conda activate lit
srun python myscript.py --slurm_id ${SLURM_JOB_ID} "$@"

stas00 commented 6 months ago

That's an interesting discovery, @schlabrendorff - though I'm not sure this is always the case.

At least it doesn't seem to impact my setup I checked I'm using SLURM 22.05.09 and I use:

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --exclusive

and then I use srun torchrun or another similar to torchrun launcher, which launches 8 processes plus workers and they all get at least 1 core each - note I'm using ntasks-per-node=1. I don't repeat --cpus-per-task with srun or have SRUN_CPUS_PER_TASK set.

I wonder if this has to do with --exclusive somehow, so it gives me all the cores? In the srun manpage for this option it says:

... but job steps will be allocated all CPUs available to the job on all nodes allocated to the steps.

Oddly I don't see any cpus-per-task related entries in scontrol show config

Which SLURM version are you using? (srun --version).

And can you check if everything works if you go back to the previous setup and add --exclusive?

schlabrendorff commented 6 months ago

My slurm cluster runs on slurm 23.02.6 that might explain the difference! I will try to experiment with the --exclusive flag, but might not get the chance for that soon.

stas00 commented 6 months ago

Ah, that's a possibility since perhaps they have planned to switch in 22.05 but didn't do it until 23.x.

When I get a chance I can try the reverse - removing --exclusive and checking that it works. but it's good to prepare ahead of time regardless so your insight is very helpful - appreciating your sharing.

stas00 commented 6 months ago

OK, I tested that w/ or w/o -exclusive I get all the available cores in the setup above:

$ scontrol show -d job 3182_1 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

It'd be 1 and not 48 if the inheritance --cpus-per-task from sbatch didn't work.

so it must be the version that I'm running that is still not affected.

schlabrendorff commented 6 months ago

I think the Request Defaults could might also come into play in my scenario, since my cluster documentation specifies.

Request Defaults: Unless specified, your jobs will run with the following options to salloc and sbatch options for this partition. --time=01:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120

Could be that your default is all (available?) CPUs?

stas00 commented 6 months ago

How do I get Request Defaults?

schlabrendorff commented 6 months ago

They were listed in my university's cluster documentation page. I guess besides trying to read your /etc/slurm/slurm.conf you can maybe submit an empty job with absolutely no resources requested and check with scontrol show job ?

stas00 commented 6 months ago

Great idea!

I did:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

and got:

$ scontrol show -d job 3185 | grep CPUs/Task
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

so the default is indeed 1.

If I do:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

I get 48

scontrol show -d job 3186 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

Bottom line: the "inheritance" still works in SLURM 22.05.09

stas00 commented 6 months ago

I realized I got my diagnostics wrong. scontrol show -d job shows the sbatch/salloc setting, it doesn't know anything about srun.

Using len(os.sched_getaffinity(0)) should give us the correct diagnostics, as it shows which cpu-cores are eligible to be used by the current process.

So here is the updated test:

$ cat test.slurm
#!/bin/bash
#SBATCH --job-name=test-cpu-cores-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00
#SBATCH --partition=x
#SBATCH --output=%x-%j.out

srun --jobid $SLURM_JOB_ID python -c 'import os; print(f"cpu cores: {len(os.sched_getaffinity(0))}")'

gives:

cpu cores: 48

The previous comment's conclusion still holds.

I documented it here

awaelchli commented 4 months ago

There was never a clear conclusion here was there? Is there anything we need to action on the Lightning side?

TeddLi commented 2 months ago

thon failed to use multi

Just wondering what is the final results with Srun and without Srun. Does Srun give worse results?

srmsoumya commented 8 hours ago

@stas00 Thank you for maintaining the ML-engineering guide!

I noticed that you recommend setting --ntasks-per-node=1, whereas the Lightning documentation suggests --ntasks-per-node=8 (which corresponds to the number of GPUs per node). When I tried setting it to 1, I encountered an error with Lightning. Did you experience a similar issue during your implementation?

stas00 commented 7 hours ago

You're correct, @srmsoumya - this was a copy-n-paste from torchrun setup where it's always one task. Fixed here.

thank you very much for the heads up.

Lightning-AI / pytorch-lightning