Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.48k stars 3.39k forks source link

Error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE #20391

Open guarin opened 3 weeks ago

guarin commented 3 weeks ago

Bug description

Would it be possible for Lightning to raise an error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE in case both are set?

With a single node the current behavior is:

Example scripts:

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3

source .venv/bin/activate
srun python train_lightning.py

And train_lightning.py:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os

def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)

if __name__ == "__main__":
    main()

This generates the following output:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2]

  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | layer | Linear | 66     | train
-----------------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
1         Modules in train mode
0         Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.
LOCAL_RANK=0, SLURM_NTASKS=1, SLURM_NTASKS_PER_NODE=2
trainer.num_devices: 2

MEMBER: 1/1 indicates that only 1 GPU is used but trainer.num_devices returns 2. nvidia-smi also indicates that only a single device is used.

Not sure if there is a valid use case for SLURM_NTASKS < SLURM_NTASKS_PER_NODE. But if there is not it would be awesome if Lightning could raise an error in this scenario.

The same error also happens if --ntasks-per-node is not set. In this case Lightning assumes 2 devices (I guess based on CUDA_VISIBLE_DEVICES) but in reality only a single one is used.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - available: True - version: 12.4 * Lightning: - lightning-utilities: 0.11.8 - pytorch-lightning: 2.4.0 - torch: 2.5.1 - torchmetrics: 1.4.3 - torchvision: 0.20.1 * Packages: - aenum: 3.1.15 - aiohappyeyeballs: 2.4.3 - aiohttp: 3.10.10 - aiosignal: 1.3.1 - annotated-types: 0.7.0 - antlr4-python3-runtime: 4.9.3 - attrs: 24.2.0 - autocommand: 2.2.2 - backports.tarfile: 1.2.0 - certifi: 2024.8.30 - charset-normalizer: 3.4.0 - eval-type-backport: 0.2.0 - filelock: 3.16.1 - frozenlist: 1.5.0 - fsspec: 2024.10.0 - hydra-core: 1.3.2 - idna: 3.10 - importlib-metadata: 8.0.0 - importlib-resources: 6.4.0 - inflect: 7.3.1 - jaraco.collections: 5.1.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jinja2: 3.1.4 - lightly: 1.5.13 - lightning-utilities: 0.11.8 - markupsafe: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - multidict: 6.1.0 - networkx: 3.4.2 - numpy: 2.1.3 - nvidia-cublas-cu12: 12.4.5.8 - nvidia-cuda-cupti-cu12: 12.4.127 - nvidia-cuda-nvrtc-cu12: 12.4.127 - nvidia-cuda-runtime-cu12: 12.4.127 - nvidia-cudnn-cu12: 9.1.0.70 - nvidia-cufft-cu12: 11.2.1.3 - nvidia-curand-cu12: 10.3.5.147 - nvidia-cusolver-cu12: 11.6.1.9 - nvidia-cusparse-cu12: 12.3.1.170 - nvidia-nccl-cu12: 2.21.5 - nvidia-nvjitlink-cu12: 12.4.127 - nvidia-nvtx-cu12: 12.4.127 - omegaconf: 2.3.0 - packaging: 24.1 - pillow: 11.0.0 - platformdirs: 4.2.2 - propcache: 0.2.0 - psutil: 6.1.0 - pyarrow: 18.0.0 - pydantic: 2.9.2 - pydantic-core: 2.23.4 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.4.0 - pytz: 2024.2 - pyyaml: 6.0.2 - requests: 2.32.3 - setuptools: 75.3.0 - six: 1.16.0 - sympy: 1.13.1 - tomli: 2.0.1 - torch: 2.5.1 - torchmetrics: 1.4.3 - torchvision: 0.20.1 - tqdm: 4.66.6 - triton: 3.1.0 - typeguard: 4.3.0 - typing-extensions: 4.12.2 - urllib3: 2.2.3 - wheel: 0.43.0 - yarl: 1.17.1 - zipp: 3.19.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.12.3 - release: 6.8.0-38-generic - version: #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024

More info

No response

lantiga commented 1 week ago

You are not required to set NTASKS explicitly see https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

What you should set is

#SBATCH --nodes=4             # This needs to match Trainer(num_nodes=...)
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8   # This needs to match Trainer(devices=...)

This way you won't have a discrepancy and things will just work. We're also doing extra validation here:

https://github.com/Lightning-AI/pytorch-lightning/blob/173cb8c1d1f1cb3b42409a83909208050cb10053/src/lightning/fabric/plugins/environments/slurm.py#L158

Feel free to propose an addition here if you uncover a case that is not covered.