Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.95k stars 3.34k forks source link

Using srun on a SLURM cluster causes the job to exit prematurely, even with auto_requeue=False #20056

Open alexanderswerdlow opened 2 months ago

alexanderswerdlow commented 2 months ago

Bug description

This assertion raises when inside SRUN using multiple tasks, even after doing adding SLURMEnvironment(auto_requeue=False). The jobs are initially submitted through submitit-slurm.

Even after: os.environ["SLURM_NTASKS_PER_NODE"] = os.environ["SLURM_TASKS_PER_NODE"], another error occurs.

Also following the advice from this thread (#6389) to do the following does not work. See also #15709.

from lightning.pytorch.plugins.environments import SLURMEnvironment
SLURMEnvironment.detect = lambda: False
trainer = Trainer(...)

There should be a simple and unambiguous way to disable all special SLURM handling in lightning.

After making _validate_srun_used, _validate_srun_variables, _is_srun_used, _is_slurm_interactive_mode, all return False/early, my code runs fine! There should be a simple disable_slurm_detection flag, ideally also configurable as an environment variable

Thanks!

awaelchli commented 2 months ago

There are two ways we support running on SLURM. The first is through sbatch and the second in interactive mode. These are the ways used by most Lightning + SLURM users that we know of.

If you want to run on a SLURM environment but not make Lightning aware of SLURM, then you can set the plugin like so:

from lightning.pytorch.plugins.environments import LightningEnvironment

trainer = Trainer(..., plugins=LightningEnvironment())

I am not aware of that submitit launcher you mentioned. If it does things differently than a regular SLURM submission script, then it's not supported. In general if we don't document an integration with a special tool, it's likely not supported. Support can be added if the demand is popular and there is interest from the community to maintain it :)

alexanderswerdlow commented 2 months ago

For some context, this is the launcher I'm referencing, but it doesn't do anything very special. At the end of the day it just writes a sbatch script that amounts to: srun python -u -m submitit which unpickles and runs some code. It also handles resubmitting [like I think lightning does by default] but is a bit more program agnostic and interfaces with hydra for hyperparam experiments.

Maybe I'm missing something obvious but even as a relatively experienced user, it is quite difficult to understand the interaction between lightning if you want to say, use torchelastic. I spent a long time trying various strategies that work without lightning [but didn't work with it, for me], following this repo and many others across github, e.g., srun torchrun script.py with various combination of task numbers [e.g., one task per node or one per gpu] and various strategies given to lightning [e.g., auto, specified DDP, specified elastic env, or slurm env].

lo and behold, Trainer(..., plugins=LightningEnvironment()) fixes things, allowing me to use torchrun/elastic around lightning, as done here. Thank you for that!!

Perhaps it'd be good to call this out in the docs and maybe even at runtime as I didn't see any logs about this. One thing I've found with more specialized features such as SLURM support is that the number of reference examples is fewer [compared to say, standard DDP] so it's not immediately obvious how everything interacts, which makes it especially important that things are decoupled by default, or at least it's obvious where they interact.

That way, it's much easier to use other tools/examples that only support/demonstrate for the common case.

awaelchli commented 2 months ago

You haven't described that you are mixing this with torchrun before. If you're already using srun to launch the processes, why is there a need to mix torchrun into it? These are two different launchers, I don't think it's a good idea to mix these, it just makes everything more complicated than it needs to be for the user.

I'd like to understand why you've gone with an external example rather than following the official Lightning guide that explains everything step by step: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#design-your-training-script What's not working in the template sbatch script we provide? If it isn't, we need to find out why.

At the end of the day it just writes a sbatch script that amounts to: srun python -u -m submitit

Would you mind sharing the generated sbatch script?

alexanderswerdlow commented 2 months ago

For sure:

#!/bin/bash

# Parameters
#SBATCH --comment=commenthere
#SBATCH --constraint=A5000
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-node=8
#SBATCH --job-name=main
#SBATCH --mem=320GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/folderpath2024_07_09/13_57_03/%j/%j_0_log.out
#SBATCH --partition=all
#SBATCH --signal=USR2@900
#SBATCH --time=360
#SBATCH --wckey=submitit

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( ($SLURM_JOB_ID % 20001) + 30000 ))
export NCCL_DEBUG=INFO
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export OMP_NUM_THREADS=2
export HYDRA_FULL_ERROR=1
export STDOUT_PATH=$(scontrol show job $SLURM_JOB_ID | grep -oP "StdOut=\K[^ ]+")
export LOCAL_JOB_FOLDER=$(dirname $STDOUT_PATH)
export MAIN_LOG_PATH="$LOCAL_JOB_FOLDER/log.txt"
printenv > $LOCAL_JOB_FOLDER/env_$SLURM_LOCALID.txt

echo "ibstatus: $(ibstatus)"
echo "ibdev2netdev: $(ibdev2netdev)"
echo "rdma device: $(rdma link)"
echo "environment: $(env | grep NCCL)"
echo LOCAL_JOB_FOLDER: $LOCAL_JOB_FOLDER, SLURM_NNODES: $SLURM_NNODES, MASTER_ADDR: $MASTER_ADDR, MASTER_PORT: $MASTER_PORT, NCCL_DEBUG_FILE: $NCCL_DEBUG_FILE, SUBMITIT_FOLDER: $SUBMITIT_FOLDER, SLURM_PROCID: $SLURM_PROCID

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output /folderpath2024_07_09/13_57_03/%j/%j_%t_log.out /pythonenv/bin/python -u -m submitit.core._submit /folderpath2024_07_09/13_57_03/%j

When I was trying torchrun, I modified the template slightly so it became the following:

srun --unbuffered --output /folder/%j/%j_%t_log.out torchrun --nnodes $SLURM_NNOD
ES --nproc_per_node $SLURM_GPUS_PER_NODE --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT -m submitit.core._sub
mit /folder/2024_07_09/15_05_42/%j

I should note I was only able to get it working for n=1 tasks per node.

I had a few reasons, one of them being my DDP, especially on say 16 GPUs was much slower than single GPU so I was trying to debug and make sure I was doing everything right like using infiniband [in the right way], and at one point trying to see if I could detect stragger GPUs following NeMO, and possibly make my training elastic so nodes could leave/join. Since there aren't that many examples out there for very specific training combinations, I wanted to use what was most common in examples for all of these things.

I didn't try that sbatch script directly, although I used it for reference [and it worked], since I wanted something more automated like with submitit. All of the examples I saw [not just the one I referenced] that used torchrun, did something like srun torchrun [some with n=1 or n=gpu tasks per node], so I was following this.

Another bonus [for me] is that my codebase isn't entirely lightning so making this work more generically allows me to use the same process for some other standalone scripts.

PheelaV commented 1 month ago

There are two ways we support running on SLURM. The first is through sbatch and the second in interactive mode. These are the ways used by most Lightning + SLURM users that we know of.

If you want to run on a SLURM environment but not make Lightning aware of SLURM, then you can set the plugin like so:

from lightning.pytorch.plugins.environments import LightningEnvironment

trainer = Trainer(..., plugins=LightningEnvironment())

I am not aware of that submitit launcher you mentioned. If it does things differently than a regular SLURM submission script, then it's not supported. In general if we don't document an integration with a special tool, it's likely not supported. Support can be added if the demand is popular and there is interest from the community to maintain it :)

This results in

python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 248, in _check_config_and_set_final_flags
    raise MisconfigurationException(
lightning_fabric.utilities.exceptions.MisconfigurationException: Found invalid type for plugin <lightning.fabric.plugins.environments.lightning.LightningEnvironment object at 0x151ccf0ed9a0>. Expected one of: Precision, CheckpointIO, ClusterEnviroment, or LayerSync.

(lightning 2.4)

Shall I open a new issue?