Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.78k stars 3.34k forks source link

Multi-GPU/CPU DDP freezes on cluster node, but not on local machine #16313

Closed moritzschaefer closed 1 year ago

moritzschaefer commented 1 year ago

Bug description

On my server node, training a LightningModule using DDP leads to a freeze, even before entering the training loop.

The node has 2 GPUs and the freeze occurs indepently of whether acceleator is set to "gpu" or "cpu".

Notably, on my local machine, running trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="cpu") does not lead to a freeze, so it's somewhat an hardware/machine/environment issue.

Any idea how to debug this issue?

PS: cross posted from discussions: https://github.com/Lightning-AI/lightning/discussions/16223

How to reproduce the bug

The model source code I used was copied 1:1 from this lightning demo: https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31 (class MNISTModel). Note: I ran the code from the command line (not from within a notebook).

Here is the trainer code:

trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="gpu") # neither works with GPU nor CPU
trainer.fit(mnist_model)  # freeze happens with any model/LightningModule (tried multiple)

Error messages and logs

This is the last output before it freezes:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

Environment

I installed a fresh pytorch_lightning conda environment to make sure that an old/unsupported packages is not the issue here.

Current environment ``` * CUDA: - GPU: - NVIDIA A100-PCIE-40GB - available: True - version: 11.6 * Lightning: - lightning-utilities: 0.4.2 - pytorch-lightning: 1.8.5.post0 - pytorch-quantization: 2.1.2 - torch: 1.11.0a0+17540c5 - torch-geometric: 2.2.0 - torch-tensorrt: 1.1.0a0 - torchmetrics: 0.11.0 - torchtext: 0.12.0a0 - torchvision: 0.12.0a0 * Packages: - absl-py: 0.13.0 - aioeasywebdav: 2.4.0 - aiohttp: 3.8.3 - aiosignal: 1.3.1 - alabaster: 0.7.12 - alphafold-colabfold: 2.1.16 - amply: 0.1.5 - apex: 0.1 - appdirs: 1.4.4 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - asttokens: 2.1.0 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attmap: 0.13.2 - attrs: 22.1.0 - audioread: 2.1.9 - autopep8: 2.0.1 - babel: 2.9.1 - backcall: 0.2.0 - backports.functools-lru-cache: 1.6.4 - bcrypt: 3.2.2 - beautifulsoup4: 4.11.1 - biopython: 1.79 - black: 22.10.0 - bleach: 4.1.0 - blis: 0.7.9 - boto3: 1.26.12 - botocore: 1.29.13 - brotlipy: 0.7.0 - cached-property: 1.5.2 - cachetools: 5.2.0 - catalogue: 2.0.8 - certifi: 2022.9.24 - cffi: 1.15.1 - chardet: 5.0.0 - charset-normalizer: 2.1.1 - chex: 0.1.4 - click: 8.1.3 - cloudpickle: 2.0.0 - codecov: 2.1.12 - colabfold: 1.3.0 - colorama: 0.4.6 - commonmark: 0.9.1 - conda: 22.9.0 - conda-build: 3.23.1 - conda-package-handling: 1.9.0 - configargparse: 1.5.3 - configparser: 5.3.0 - connection-pool: 0.0.3 - contextlib2: 21.6.0 - coverage: 6.3.1 - cryptography: 38.0.3 - cudf: 21.12.0a0+293.g0930f712e6 - cugraph: 21.12.0a0+95.g4b8c1330 - cuml: 21.12.0a0+116.g4ce5bd609 - cupy-cuda115: 9.6.0 - cycler: 0.11.0 - cymem: 2.0.7 - cython: 0.29.27 - dask: 2021.11.2 - dask-cuda: 21.12.0 - dask-cudf: 21.12.0a0+293.g0930f712e6 - dataclasses: 0.8 - datrie: 0.8.2 - debugpy: 1.5.1 - decorator: 5.1.1 - deepspeed: 0.5.10 - defusedxml: 0.7.1 - dgl-cu116: 0.9.1.post1 - dglgo: 0.0.2 - distributed: 2021.11.2 - dllogger: 1.0.0 - dm-haiku: 0.0.9 - dm-tree: 0.1.7 - docker: 6.0.0 - docker-pycreds: 0.4.0 - docutils: 0.17.1 - dpath: 2.0.6 - dropbox: 11.36.0 - e3nn: 0.3.3 - einops: 0.6.0 - entrypoints: 0.3 - etils: 0.7.1 - exceptiongroup: 1.0.4 - executing: 1.2.0 - expecttest: 0.1.3 - fastjsonschema: 2.16.2 - fastrlock: 0.8 - filechunkio: 1.8 - filelock: 3.8.0 - flake8: 3.7.9 - flash-attn: 0.1 - flask: 2.0.3 - flatbuffers: 2.0.7 - fonttools: 4.29.1 - frozenlist: 1.3.3 - fsspec: 2022.1.0 - ftputil: 5.0.4 - future: 0.18.2 - gast: 0.4.0 - gitdb: 4.0.9 - gitpython: 3.1.29 - glob2: 0.7 - google-api-core: 2.10.2 - google-api-python-client: 2.66.0 - google-auth: 2.14.1 - google-auth-httplib2: 0.1.0 - google-auth-oauthlib: 0.4.6 - google-cloud-core: 2.3.2 - google-cloud-storage: 2.6.0 - google-crc32c: 1.1.2 - google-pasta: 0.2.0 - google-resumable-media: 2.4.0 - googleapis-common-protos: 1.57.0 - graphsurgeon: 0.4.5 - grpcio: 1.49.1 - guided-protein-diffusion: 0.1.0 - h5py: 3.7.0 - heapdict: 1.0.1 - hjson: 3.1.0 - httplib2: 0.21.0 - hypothesis: 4.50.8 - idna: 3.4 - imageio: 2.23.0 - imagesize: 1.3.0 - immutabledict: 2.2.1 - importlib-metadata: 4.13.0 - importlib-resources: 5.10.0 - iniconfig: 1.1.1 - ipdb: 0.13.11 - ipykernel: 6.9.0 - ipython: 8.0.1 - ipython-genutils: 0.2.0 - isort: 5.11.3 - itsdangerous: 2.0.1 - jax: 0.3.16 - jaxlib: 0.3.15+cuda11.cudnn82 - jedi: 0.18.1 - jinja2: 3.1.2 - jmespath: 1.0.1 - jmp: 0.0.2 - joblib: 1.1.0 - json5: 0.9.6 - jsonschema: 4.17.0 - jupyter-client: 7.1.2 - jupyter-core: 5.0.0 - jupyter-tensorboard: 0.2.0 - jupyterlab: 2.3.2 - jupyterlab-pygments: 0.1.2 - jupyterlab-server: 1.2.0 - jupytext: 1.13.7 - keras: 2.7.0 - keras-preprocessing: 1.1.2 - kiwisolver: 1.3.2 - langcodes: 3.3.0 - libarchive-c: 4.0 - libclang: 14.0.6 - libmambapy: 1.0.0 - librosa: 0.9.0 - lightning-utilities: 0.4.2 - littleutils: 0.2.2 - llvmlite: 0.36.0 - lmdb: 1.3.0 - locket: 0.2.1 - logmuse: 0.2.6 - mamba: 1.0.0 - markdown: 3.3.6 - markdown-it-py: 1.1.0 - markupsafe: 2.1.1 - matplotlib: 3.1.3 - matplotlib-inline: 0.1.6 - mccabe: 0.6.1 - mdit-py-plugins: 0.3.0 - mistune: 0.8.4 - ml-collections: 0.1.1 - mock: 4.0.3 - mpmath: 1.2.1 - msgpack: 1.0.3 - multidict: 6.0.2 - murmurhash: 1.0.9 - mypy-extensions: 0.4.3 - nbclient: 0.5.11 - nbconvert: 6.4.2 - nbformat: 5.7.0 - nest-asyncio: 1.5.4 - networkx: 2.6.3 - ninja: 1.11.1 - nltk: 3.7 - notebook: 6.4.1 - numba: 0.53.1 - numpy: 1.22.2 - numpydoc: 1.5.0 - nvidia-dali-cuda110: 1.10.0 - nvidia-pyindex: 1.0.9 - nvtx: 0.2.4 - oauth2client: 4.1.3 - oauthlib: 3.2.0 - ogb: 1.3.5 - onnx: 1.10.1 - openfold: 1.0.0 - openmm: 7.5.1 - opt-einsum: 3.3.0 - opt-einsum-fx: 0.1.4 - outdated: 0.2.2 - packaging: 21.3 - pandas: 1.5.1 - pandocfilters: 1.5.0 - paramiko: 2.12.0 - parso: 0.8.3 - partd: 1.2.0 - pathspec: 0.10.2 - pathtools: 0.1.2 - pathy: 0.8.1 - pdb-tools: 2.5.0 - pdbfixer: 1.7 - peppy: 0.35.3 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.2.0 - pip: 21.2.4 - pkginfo: 1.8.3 - pkgutil-resolve-name: 1.3.10 - plac: 1.3.5 - platformdirs: 2.5.2 - pluggy: 1.0.0 - ply: 3.11 - polygraphy: 0.33.0 - pooch: 1.6.0 - preshed: 3.0.8 - prettytable: 3.4.1 - prometheus-client: 0.13.1 - promise: 2.3 - prompt-toolkit: 3.0.32 - protobuf: 3.19.6 - psutil: 5.9.4 - ptyprocess: 0.7.0 - pulp: 2.7.0 - pure-eval: 0.2.2 - py: 1.11.0 - py-cpuinfo: 9.0.0 - py3dmol: 1.8.1 - pyarrow: 5.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.9.1 - pycocotools: 2.0+nv0.6.0 - pycodestyle: 2.10.0 - pycosat: 0.6.4 - pycparser: 2.21 - pydantic: 1.10.2 - pydot: 1.4.2 - pyflakes: 2.1.1 - pygments: 2.13.0 - pymol: 2.5.4 - pynacl: 1.5.0 - pynvml: 11.0.0 - pyopenssl: 22.1.0 - pyparsing: 3.0.9 - pyqt5: 5.12.3 - pyqt5-sip: 4.19.18 - pyqtchart: 5.12 - pyqtwebengine: 5.12.1 - pyrsistent: 0.19.2 - pysftp: 0.2.9 - pysocks: 1.7.1 - pytest: 7.2.0 - pytest-cov: 3.0.0 - pytest-pythonpath: 0.7.4 - python-dateutil: 2.8.2 - python-hostlist: 1.21 - python-irodsclient: 1.1.5 - python-nvd3: 0.15.0 - python-slugify: 5.0.2 - pytorch-lightning: 1.8.5.post0 - pytorch-quantization: 2.1.2 - pytz: 2022.6 - pyu2f: 0.1.5 - pyyaml: 6.0 - pyzmq: 22.3.0 - rdkit-pypi: 2022.9.3 - regex: 2022.1.18 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - reretry: 0.11.1 - resampy: 0.2.2 - revtok: 0.0.3 - rich: 12.6.0 - rmm: 21.12.0a0+31.g0acbd51 - rsa: 4.9 - ruamel-yaml-conda: 0.15.80 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.7 - s3transfer: 0.6.0 - sacremoses: 0.0.47 - scikit-learn: 0.24.0 - scipy: 1.6.3 - se3-transformer: 1.0.0 - seaborn: 0.12.2 - send2trash: 1.8.0 - sentry-sdk: 1.12.0 - setuptools: 59.5.0 - shellingham: 1.5.0 - shortuuid: 1.0.11 - six: 1.16.0 - slacker: 0.14.0 - smart-open: 5.2.1 - smmap: 3.0.5 - snakemake: 7.18.2 - snowballstemmer: 2.2.0 - sortedcontainers: 2.4.0 - soundfile: 0.10.3.post1 - soupsieve: 2.3.2.post1 - spacy: 3.2.1 - spacy-legacy: 3.0.10 - spacy-loggers: 1.0.3 - sphinx: 4.4.0 - sphinx-glpi-theme: 0.3 - sphinx-rtd-theme: 1.0.0 - sphinxcontrib-applehelp: 1.0.2 - sphinxcontrib-devhelp: 1.0.2 - sphinxcontrib-htmlhelp: 2.0.0 - sphinxcontrib-jsmath: 1.0.1 - sphinxcontrib-qthelp: 1.0.3 - sphinxcontrib-serializinghtml: 1.1.5 - srsly: 2.4.5 - stack-data: 0.6.1 - stone: 3.3.1 - stopit: 1.1.2 - subprocess32: 3.5.4 - sympy: 1.11.1 - tabulate: 0.9.0 - tblib: 1.7.0 - tensorboard: 2.8.0 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - tensorboardx: 2.5.1 - tensorflow-cpu: 2.7.3 - tensorflow-estimator: 2.7.0 - tensorflow-io-gcs-filesystem: 0.26.0 - tensorrt: 8.2.3.0 - termcolor: 1.1.0 - terminado: 0.13.1 - testpath: 0.5.0 - text-unidecode: 1.3 - thinc: 8.0.17 - threadpoolctl: 3.1.0 - throttler: 1.2.1 - toml: 0.10.2 - tomli: 2.0.1 - toolz: 0.12.0 - toposort: 1.7 - torch: 1.11.0a0+17540c5 - torch-geometric: 2.2.0 - torch-tensorrt: 1.1.0a0 - torchmetrics: 0.11.0 - torchtext: 0.12.0a0 - torchvision: 0.12.0a0 - tornado: 6.1 - tqdm: 4.64.1 - traitlets: 5.5.0 - treelite: 2.1.0 - treelite-runtime: 2.1.0 - triton: 1.0.0 - typer: 0.4.2 - typing-extensions: 4.4.0 - ubiquerg: 0.6.2 - ucx-py: 0.21.0a0+37.gbfa0450 - uff: 0.6.9 - uritemplate: 4.1.1 - urllib3: 1.26.11 - veracitools: 0.1.3 - wandb: 0.12.0 - wasabi: 0.10.1 - wcwidth: 0.2.5 - webencodings: 0.5.1 - websocket-client: 1.4.0 - werkzeug: 2.0.3 - wget: 3.2 - wheel: 0.38.4 - wrapt: 1.14.1 - xgboost: 1.5.0 - yarl: 1.8.1 - yte: 1.5.1 - zict: 2.0.0 - zipp: 3.10.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.13 - version: #144-Ubuntu SMP Tue Sep 20 11:00:04 UTC 2022 ```

More info

No response

cc @justusschock @awaelchli

awaelchli commented 1 year ago

@moritzschaefer Difficult to say what it is. Could you do the following for me?

  1. print(os.environ) at the beginning of the script and posting it here (remove any sensitive information before posting here).
  2. Run the script with NCCL_DEBUG=INFO python ... and post the debug log information if there is any.
moritzschaefer commented 1 year ago

Thank you for having a look @awaelchli! lightning_demo.txt env.txt

Setting NCCL_DEBUG=INFO did not seem to change much. I get the following output to stderr. The os.environ print is attached as file, as well as the source code

 GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning-1.8.5.post0-py3.8.egg/pytorch_lightning/trainer/setup.py:175: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
  rank_zero_warn(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning-1.8.5.post0-py3.8.egg/pytorch_lightning/loops/utilities.py:94: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  rank_zero_warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
justusschock commented 1 year ago

@moritzschaefer NCCL_DEBUG=INFO can only work when running on GPU, as NCCL is the communication lib from nvidia for their gpus. If you set this variable and run on GPU again, you should see more prints.

awaelchli commented 1 year ago

I can see from your environment that there are many slurm variables set. In particular, SLURM_TASKS_PER_NODE=1. Why is this the case? You didn't mention that you are launching with SLURM. Are you? If so, please read here about extra steps to take: https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html

This is for sure the reason why it is stuck. Because Trainer wants to launch 2 processes, but SLURM environment variable settings say there should be only one.

moritzschaefer commented 1 year ago

@justusschock I changed "accelerator" from "cpu" to "gpu" (upon which the "GPUs available. consider using them"-warning disappeared), but there are no further outputs.

@awaelchli I am allocating nodes via SLURM in interactive mode (i.e., I login to the compute node via SSH and run my scripts from there as on any other server/node). I did not yet intend to use any SLURM-specific features (e.g. task dispatching let alone multi-node training).

Still, the error seems to be SLURM-specifiic as the error messages is stated in the "Troubleshooting" section (thank you for pointing me there!).

It states that

the #SBATCH --nodes=X setting and #SBATCH --ntasks-per-node=Y settings. The numbers there need to match what is configured in your Trainer in the code: Trainer(num_nodes=X, devices=Y)

I've set these numbers in accordance to my script using the srun command

# on the slurm-allocated compute node
(guided_protein_diffusion_backup) root@s0-n01:~# echo $SLURM_NTASKS_PER_NODE
2
(guided_protein_diffusion_backup) root@s0-n01:~# echo $SLURM_NNODES
1
# trainer line:
trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="gpu", num_nodes=1)

However, the script still freezes. Last idea I have is that lightning can't handle interactively allocated SLURM compute nodes (although they should behave the same as non-interactive nodes).

I'll try to dispatch a noninteractive job and get back to you. Thank you again!

awaelchli commented 1 year ago

To run in interactive mode, you need to set the job name to "bash". Otherwise Lightning can't distinguish between running interactively or not. Perhaps this should be documented a bit more obviously. Can you try that?

moritzschaefer commented 1 year ago

The name was already "bash" :/

(guided_protein_diffusion_backup) root@s0-n00:~/run_scripts# echo $SLURM_JOB_NAME bash

The non-interactive script runs now on 2 GPUs!

Let me know, if I can try anything else. I don't urgently need the multi-GPU in interactive mode but would be great if we could find a way to fix it and add it to the documentation.

From my side, feel free to close the issue. Thank you both for your support!

awaelchli commented 1 year ago

Thanks @moritzschaefer If it is like you say, then that seems fine to me. Thanks for the help from your side too! Maybe we have a bug? Since I don't have a SLURM cluster around, could you run this simple script in interactive mode and let me know the output?

from pytorch_lightning import Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment

env = SLURMEnvironment()

# in interactive mode, we expect this to return False
print("SLURM detected", env.detect())
print("Job name", env.job_name())

trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)

# This should be LightningEnvironment:
print("selectd cluster env", trainer.strategy.cluster_environment)

# This should return False:
print(trainer.strategy.cluster_environment.creates_processes_externally)
moritzschaefer commented 1 year ago

sure, here is the output:

SLURM detected False
Job name bash
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/envs/guided_protein_diffusion_backup/lib/python3.8/site-packages/pytorch_lightning/trainer/setup.py:175: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=2)`.
  rank_zero_warn(
selectd cluster env <lightning_lite.plugins.environments.lightning.LightningEnvironment object at 0x7f3ca6172f70>
True
awaelchli commented 1 year ago

@moritzschaefer The last output, trainer.strategy.cluster_environment.creates_processes_externally, unexpectedly returns True. This is because the environment variable LOCAL_RANK': '0' is set. Why is it like that, this shouldn't just be set on your machine as default like that. Any clue why that might be?

moritzschaefer commented 1 year ago

I see. The LOCAL_RANK variable is set by my slurm/cluster environment (I instantiate a new instance running with an Nvidia/pytorch:22.03-py3 image from nvcr.io).

unset LOCAL_RANK resolves the issue.

I'll close the issue now and hope it will benefit future googlers. Here is a start for adding some of the information to the documentation, if you think its beneficial:

When running DDP on a single interactive mult-GPU node via SLURM ( (srun --pty bash), make sure the 'LOCAL_RANK' variable is not set, since it interferes with the environment detection of pytorch-lightning.

Darius888 commented 3 months ago

Thank you so much @moritzschaefer ! It helped quite a lot