Open alexanderswerdlow opened 2 months ago
There are two ways we support running on SLURM. The first is through sbatch and the second in interactive mode. These are the ways used by most Lightning + SLURM users that we know of.
If you want to run on a SLURM environment but not make Lightning aware of SLURM, then you can set the plugin like so:
from lightning.pytorch.plugins.environments import LightningEnvironment
trainer = Trainer(..., plugins=LightningEnvironment())
I am not aware of that submitit launcher you mentioned. If it does things differently than a regular SLURM submission script, then it's not supported. In general if we don't document an integration with a special tool, it's likely not supported. Support can be added if the demand is popular and there is interest from the community to maintain it :)
For some context, this is the launcher I'm referencing, but it doesn't do anything very special. At the end of the day it just writes a sbatch script that amounts to: srun python -u -m submitit
which unpickles and runs some code. It also handles resubmitting [like I think lightning does by default] but is a bit more program agnostic and interfaces with hydra for hyperparam experiments.
Maybe I'm missing something obvious but even as a relatively experienced user, it is quite difficult to understand the interaction between lightning if you want to say, use torchelastic. I spent a long time trying various strategies that work without lightning [but didn't work with it, for me], following this repo and many others across github, e.g., srun torchrun script.py
with various combination of task numbers [e.g., one task per node or one per gpu] and various strategies given to lightning [e.g., auto, specified DDP, specified elastic env, or slurm env].
lo and behold, Trainer(..., plugins=LightningEnvironment())
fixes things, allowing me to use torchrun/elastic around lightning, as done here. Thank you for that!!
Perhaps it'd be good to call this out in the docs and maybe even at runtime as I didn't see any logs about this. One thing I've found with more specialized features such as SLURM support is that the number of reference examples is fewer [compared to say, standard DDP] so it's not immediately obvious how everything interacts, which makes it especially important that things are decoupled by default, or at least it's obvious where they interact.
That way, it's much easier to use other tools/examples that only support/demonstrate for the common case.
You haven't described that you are mixing this with torchrun before. If you're already using srun to launch the processes, why is there a need to mix torchrun into it? These are two different launchers, I don't think it's a good idea to mix these, it just makes everything more complicated than it needs to be for the user.
I'd like to understand why you've gone with an external example rather than following the official Lightning guide that explains everything step by step: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#design-your-training-script What's not working in the template sbatch script we provide? If it isn't, we need to find out why.
At the end of the day it just writes a sbatch script that amounts to: srun python -u -m submitit
Would you mind sharing the generated sbatch script?
For sure:
#!/bin/bash
# Parameters
#SBATCH --comment=commenthere
#SBATCH --constraint=A5000
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-node=8
#SBATCH --job-name=main
#SBATCH --mem=320GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/folderpath2024_07_09/13_57_03/%j/%j_0_log.out
#SBATCH --partition=all
#SBATCH --signal=USR2@900
#SBATCH --time=360
#SBATCH --wckey=submitit
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( ($SLURM_JOB_ID % 20001) + 30000 ))
export NCCL_DEBUG=INFO
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export OMP_NUM_THREADS=2
export HYDRA_FULL_ERROR=1
export STDOUT_PATH=$(scontrol show job $SLURM_JOB_ID | grep -oP "StdOut=\K[^ ]+")
export LOCAL_JOB_FOLDER=$(dirname $STDOUT_PATH)
export MAIN_LOG_PATH="$LOCAL_JOB_FOLDER/log.txt"
printenv > $LOCAL_JOB_FOLDER/env_$SLURM_LOCALID.txt
echo "ibstatus: $(ibstatus)"
echo "ibdev2netdev: $(ibdev2netdev)"
echo "rdma device: $(rdma link)"
echo "environment: $(env | grep NCCL)"
echo LOCAL_JOB_FOLDER: $LOCAL_JOB_FOLDER, SLURM_NNODES: $SLURM_NNODES, MASTER_ADDR: $MASTER_ADDR, MASTER_PORT: $MASTER_PORT, NCCL_DEBUG_FILE: $NCCL_DEBUG_FILE, SUBMITIT_FOLDER: $SUBMITIT_FOLDER, SLURM_PROCID: $SLURM_PROCID
# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output /folderpath2024_07_09/13_57_03/%j/%j_%t_log.out /pythonenv/bin/python -u -m submitit.core._submit /folderpath2024_07_09/13_57_03/%j
When I was trying torchrun, I modified the template slightly so it became the following:
srun --unbuffered --output /folder/%j/%j_%t_log.out torchrun --nnodes $SLURM_NNOD
ES --nproc_per_node $SLURM_GPUS_PER_NODE --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT -m submitit.core._sub
mit /folder/2024_07_09/15_05_42/%j
I should note I was only able to get it working for n=1 tasks per node.
I had a few reasons, one of them being my DDP, especially on say 16 GPUs was much slower than single GPU so I was trying to debug and make sure I was doing everything right like using infiniband [in the right way], and at one point trying to see if I could detect stragger GPUs following NeMO, and possibly make my training elastic so nodes could leave/join. Since there aren't that many examples out there for very specific training combinations, I wanted to use what was most common in examples for all of these things.
I didn't try that sbatch script directly, although I used it for reference [and it worked], since I wanted something more automated like with submitit. All of the examples I saw [not just the one I referenced] that used torchrun, did something like srun torchrun [some with n=1 or n=gpu tasks per node], so I was following this.
Another bonus [for me] is that my codebase isn't entirely lightning so making this work more generically allows me to use the same process for some other standalone scripts.
There are two ways we support running on SLURM. The first is through sbatch and the second in interactive mode. These are the ways used by most Lightning + SLURM users that we know of.
If you want to run on a SLURM environment but not make Lightning aware of SLURM, then you can set the plugin like so:
from lightning.pytorch.plugins.environments import LightningEnvironment trainer = Trainer(..., plugins=LightningEnvironment())
I am not aware of that submitit launcher you mentioned. If it does things differently than a regular SLURM submission script, then it's not supported. In general if we don't document an integration with a special tool, it's likely not supported. Support can be added if the demand is popular and there is interest from the community to maintain it :)
This results in
python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 248, in _check_config_and_set_final_flags
raise MisconfigurationException(
lightning_fabric.utilities.exceptions.MisconfigurationException: Found invalid type for plugin <lightning.fabric.plugins.environments.lightning.LightningEnvironment object at 0x151ccf0ed9a0>. Expected one of: Precision, CheckpointIO, ClusterEnviroment, or LayerSync.
(lightning 2.4)
Shall I open a new issue?
Bug description
This assertion raises when inside SRUN using multiple tasks, even after doing adding
SLURMEnvironment(auto_requeue=False)
. The jobs are initially submitted through submitit-slurm.Even after:
os.environ["SLURM_NTASKS_PER_NODE"] = os.environ["SLURM_TASKS_PER_NODE"]
, another error occurs.Also following the advice from this thread (#6389) to do the following does not work. See also #15709.
There should be a simple and unambiguous way to disable all special SLURM handling in lightning.
After making
_validate_srun_used
,_validate_srun_variables
,_is_srun_used
,_is_slurm_interactive_mode
, all return False/early, my code runs fine! There should be a simpledisable_slurm_detection
flag, ideally also configurable as an environment variableThanks!