Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.11k stars 3.36k forks source link

DDP training freezes immediately #17389

Open GeoffNN opened 1 year ago

GeoffNN commented 1 year ago

Bug description

I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:

python /home/negroni/deeponet-fno/src/burgers/pytorch_deeponet.py --ngpus 3

Using backend: tensorflow.compat.v1

2023-04-14 16:56:35.997710: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-14 16:56:36.145661: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-14 16:56:36.609342: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609396: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609401: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Enable just-in-time compilation with XLA.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

wandb: Currently logged in as: geoffnn. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.14.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in logs/wandb/run-20230414_165640-z1m33sa0
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run azure-smoke-113
wandb: ⭐️ View project at https://wandb.ai/geoffnn/PDEs-Burgers
wandb: 🚀 View run at https://wandb.ai/geoffnn/PDEs-Burgers/runs/z1m33sa0
Loaded data
Loaded data
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning_fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Using backend: pytorch

Using backend: pytorch

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Loaded data
Loaded data

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Loaded data
Loaded data
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
INFO:pytorch_lightning.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

What version are you seeing the problem on?

2.0+ and 1.9.x

How to reproduce the bug

I can't reduce to a small repro, but the code is here: https://github.com/GeoffNN/deeponet-fno/blob/main/src/burgers/pytorch_deeponet.py

Error messages and logs

# Error messages and logs here please

Environment

Current environment
Current environment * CUDA: - GPU: - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - available: True - version: 11.7 * Lightning: - lightning: 2.0.1 - lightning-cloud: 0.5.32 - lightning-utilities: 0.7.0 - pytorch-lightning: 1.9.3 - torch: 2.0.0 - torchaudio: 0.13.1 - torchmetrics: 0.11.1 - torchvision: 0.14.1 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - altair: 4.2.2 - anyio: 3.6.2 - appdirs: 1.4.4 - arrow: 1.2.3 - asttokens: 2.2.1 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 22.2.0 - backcall: 0.2.0 - backports.functools-lru-cache: 1.6.4 - beautifulsoup4: 4.12.0 - black: 23.3.0 - blessed: 1.20.0 - brotlipy: 0.7.0 - cachetools: 5.3.0 - certifi: 2022.12.7 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.3 - cmake: 3.26.1 - colorama: 0.4.6 - contourpy: 1.0.7 - croniter: 1.3.8 - cryptography: 38.0.4 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.5.1 - decorator: 5.1.1 - deepdiff: 6.3.0 - deepxde: 1.8.0 - dnspython: 2.3.0 - docker-pycreds: 0.4.0 - email-validator: 1.3.1 - entrypoints: 0.4 - exceptiongroup: 1.1.0 - executing: 1.2.0 - fastapi: 0.88.0 - filelock: 3.10.7 - flatbuffers: 23.1.21 - flit-core: 3.6.0 - fonttools: 4.38.0 - frozenlist: 1.3.3 - fsspec: 2023.1.0 - gast: 0.4.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.16.1 - google-auth-oauthlib: 0.4.6 - google-pasta: 0.2.0 - gpustat: 1.0.0 - grpcio: 1.51.1 - h11: 0.14.0 - h5py: 3.8.0 - hcpdenn: 0.0.1 - httpcore: 0.16.3 - httptools: 0.5.0 - httpx: 0.23.3 - idna: 3.4 - importlib-metadata: 6.0.0 - importlib-resources: 5.12.0 - iniconfig: 2.0.0 - inquirer: 3.1.3 - ipykernel: 6.15.0 - ipython: 8.10.0 - itsdangerous: 2.1.2 - jax: 0.3.25 - jaxlib: 0.3.25+cuda11.cudnn82 - jedi: 0.18.2 - jinja2: 3.1.2 - joblib: 1.2.0 - jsonschema: 4.17.3 - jupyter-client: 7.0.6 - jupyter-core: 4.12.0 - keras: 2.11.0 - kiwisolver: 1.4.4 - libclang: 15.0.6.1 - lightning: 2.0.1 - lightning-cloud: 0.5.32 - lightning-utilities: 0.7.0 - lit: 16.0.0 - markdown: 3.4.1 - markdown-it-py: 2.2.0 - markupsafe: 2.1.2 - matplotlib: 3.7.0 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - ml-dtypes: 0.0.4 - mpmath: 1.3.0 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - nest-asyncio: 1.5.6 - networkx: 3.0 - numpy: 1.23.5 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-ml-py: 11.495.46 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - oauthlib: 3.2.2 - opt-einsum: 3.3.0 - ordered-set: 4.1.0 - orjson: 3.8.9 - packaging: 23.0 - pandas: 1.5.3 - parso: 0.8.3 - pathspec: 0.11.1 - pathtools: 0.1.2 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.3.0 - pip: 22.3.1 - platformdirs: 3.2.0 - pluggy: 1.0.0 - pooch: 1.6.0 - prompt-toolkit: 3.0.36 - protobuf: 3.19.6 - psutil: 5.9.4 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyaml: 21.10.1 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.10.3 - pycparser: 2.21 - pydantic: 1.10.7 - pygments: 2.14.0 - pyjwt: 2.6.0 - pyopenssl: 22.0.0 - pyparsing: 3.0.9 - pyrsistent: 0.19.3 - pysocks: 1.7.1 - pytest: 7.2.1 - python-dateutil: 2.8.2 - python-dotenv: 1.0.0 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 1.9.3 - pytz: 2022.7.1 - pyyaml: 6.0 - pyzmq: 19.0.2 - readchar: 4.0.5 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - rfc3986: 1.5.0 - rich: 13.3.3 - rsa: 4.9 - scienceplots: 2.0.1 - scikit-learn: 1.2.1 - scikit-optimize: 0.9.0 - scikit-sparse: 0.4.8 - scipy: 1.10.1 - seaborn: 0.12.2 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 65.6.3 - six: 1.16.0 - sklearn: 0.0.post1 - smmap: 5.0.0 - sniffio: 1.3.0 - soupsieve: 2.4 - stack-data: 0.6.2 - starlette: 0.22.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tensorboard: 2.11.2 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - tensorflow: 2.11.0 - tensorflow-addons: 0.19.0 - tensorflow-estimator: 2.11.0 - tensorflow-io-gcs-filesystem: 0.30.0 - termcolor: 2.2.0 - theseus-ai: 0.1.4 - threadpoolctl: 3.1.0 - tomli: 2.0.1 - toolz: 0.12.0 - torch: 2.0.0 - torchaudio: 0.13.1 - torchmetrics: 0.11.1 - torchvision: 0.14.1 - tornado: 6.2 - tqdm: 4.64.1 - traitlets: 5.9.0 - triton: 2.0.0 - typeguard: 2.13.3 - typing-extensions: 4.4.0 - ujson: 5.7.0 - urllib3: 1.26.14 - uvicorn: 0.21.1 - uvloop: 0.17.0 - wandb: 0.13.10 - watchfiles: 0.19.0 - wcwidth: 0.2.6 - websocket-client: 1.5.1 - websockets: 10.4 - werkzeug: 2.2.3 - wheel: 0.38.4 - wrapt: 1.14.1 - yarl: 1.8.2 - zipp: 3.14.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.16 - version: #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningModule #- PyTorch Lightning Version (e.g., 1.5.0): 2.0.1 #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): 2.0.0 #- Python version (e.g., 3.9): 3.9.16 #- OS (e.g., Linux): Linux #- CUDA/cuDNN version: 11.7 #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): pip #- Running environment of LightningApp (e.g. local, cloud): server ```

More info

No response

cc @justusschock @awaelchli

ryan597 commented 1 year ago

Are you able to get other ddp jobs to run? Try the below script.

import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 3

model = BoringModel()
trainer = L.Trainer(max_epochs=10,
                    devices=ngpus)

trainer.fit(model)
GeoffNN commented 1 year ago

Ah. Thanks for the reduction. No, this doesn't seem to work either. Again, I get

python ~/deeponet-fno/src/burgers/toy_ddp.py
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

and then nothing.

ryan597 commented 1 year ago

I don't understand why torchvision is outputting an error here as it wasn't in the script.

Did you install PyTorch using Miniconda or pip? Try setting up a clean env

conda create -n testenv python=3.9 
conda activate testenv
pip install torch torchvision lightning
python -c "import torch; print(torch.__version__)"

Also do export NCCL_DEBUG=INFO and try the test script again in the new env.

shoang22 commented 1 year ago

Hi, was this ever fixed? I'm running into the same issue using BoringModel

ryan597 commented 1 year ago

@shoang22 the exact problem wasn't really identified. It looks like a problem in the installation. Have you tried creating a clean environment with the above steps?

shoang22 commented 1 year ago

I did try a clean install, but the problem persisted. I was, however, able to solve the problem. I was running my script on a SLURM cluster. It turns out that I needed to include srun in my bash file - sbatch wasn't enough.