Open guarin opened 3 weeks ago
You are not required to set NTASKS
explicitly
see https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html
What you should set is
#SBATCH --nodes=4 # This needs to match Trainer(num_nodes=...)
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8 # This needs to match Trainer(devices=...)
This way you won't have a discrepancy and things will just work. We're also doing extra validation here:
Feel free to propose an addition here if you uncover a case that is not covered.
Bug description
Would it be possible for Lightning to raise an error if
SLURM_NTASKS != SLURM_NTASKS_PER_NODE
in case both are set?With a single node the current behavior is:
SLURM_NTASKS == SLURM_NTASKS_PER_NODE
: Everything is fineSLURM_NTASKS > SLURM_NTASKS_PER_NODE
: Slurm doesn't let you schedule the job and raises an errorSLURM_NTASKS < SLURM_NTASKS_PER_NODE
: Lightning thinks there areSLURM_NTASKS_PER_NODE
devices but the job only runs onSLURM_NTASKS
devices.Example scripts:
And
train_lightning.py
:This generates the following output:
MEMBER: 1/1
indicates that only 1 GPU is used buttrainer.num_devices
returns 2.nvidia-smi
also indicates that only a single device is used.Not sure if there is a valid use case for
SLURM_NTASKS < SLURM_NTASKS_PER_NODE
. But if there is not it would be awesome if Lightning could raise an error in this scenario.The same error also happens if
--ntasks-per-node
is not set. In this case Lightning assumes 2 devices (I guess based onCUDA_VISIBLE_DEVICES
) but in reality only a single one is used.What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - available: True - version: 12.4 * Lightning: - lightning-utilities: 0.11.8 - pytorch-lightning: 2.4.0 - torch: 2.5.1 - torchmetrics: 1.4.3 - torchvision: 0.20.1 * Packages: - aenum: 3.1.15 - aiohappyeyeballs: 2.4.3 - aiohttp: 3.10.10 - aiosignal: 1.3.1 - annotated-types: 0.7.0 - antlr4-python3-runtime: 4.9.3 - attrs: 24.2.0 - autocommand: 2.2.2 - backports.tarfile: 1.2.0 - certifi: 2024.8.30 - charset-normalizer: 3.4.0 - eval-type-backport: 0.2.0 - filelock: 3.16.1 - frozenlist: 1.5.0 - fsspec: 2024.10.0 - hydra-core: 1.3.2 - idna: 3.10 - importlib-metadata: 8.0.0 - importlib-resources: 6.4.0 - inflect: 7.3.1 - jaraco.collections: 5.1.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jinja2: 3.1.4 - lightly: 1.5.13 - lightning-utilities: 0.11.8 - markupsafe: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - multidict: 6.1.0 - networkx: 3.4.2 - numpy: 2.1.3 - nvidia-cublas-cu12: 12.4.5.8 - nvidia-cuda-cupti-cu12: 12.4.127 - nvidia-cuda-nvrtc-cu12: 12.4.127 - nvidia-cuda-runtime-cu12: 12.4.127 - nvidia-cudnn-cu12: 9.1.0.70 - nvidia-cufft-cu12: 11.2.1.3 - nvidia-curand-cu12: 10.3.5.147 - nvidia-cusolver-cu12: 11.6.1.9 - nvidia-cusparse-cu12: 12.3.1.170 - nvidia-nccl-cu12: 2.21.5 - nvidia-nvjitlink-cu12: 12.4.127 - nvidia-nvtx-cu12: 12.4.127 - omegaconf: 2.3.0 - packaging: 24.1 - pillow: 11.0.0 - platformdirs: 4.2.2 - propcache: 0.2.0 - psutil: 6.1.0 - pyarrow: 18.0.0 - pydantic: 2.9.2 - pydantic-core: 2.23.4 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.4.0 - pytz: 2024.2 - pyyaml: 6.0.2 - requests: 2.32.3 - setuptools: 75.3.0 - six: 1.16.0 - sympy: 1.13.1 - tomli: 2.0.1 - torch: 2.5.1 - torchmetrics: 1.4.3 - torchvision: 0.20.1 - tqdm: 4.66.6 - triton: 3.1.0 - typeguard: 4.3.0 - typing-extensions: 4.12.2 - urllib3: 2.2.3 - wheel: 0.43.0 - yarl: 1.17.1 - zipp: 3.19.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.12.3 - release: 6.8.0-38-generic - version: #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024More info
No response