Open GeoffNN opened 1 year ago
Are you able to get other ddp jobs to run? Try the below script.
import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel
ngpus = 3
model = BoringModel()
trainer = L.Trainer(max_epochs=10,
devices=ngpus)
trainer.fit(model)
Ah. Thanks for the reduction. No, this doesn't seem to work either. Again, I get
python ~/deeponet-fno/src/burgers/toy_ddp.py
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
and then nothing.
I don't understand why torchvision is outputting an error here as it wasn't in the script.
Did you install PyTorch using Miniconda or pip? Try setting up a clean env
conda create -n testenv python=3.9
conda activate testenv
pip install torch torchvision lightning
python -c "import torch; print(torch.__version__)"
Also do export NCCL_DEBUG=INFO
and try the test script again in the new env.
Hi, was this ever fixed? I'm running into the same issue using BoringModel
@shoang22 the exact problem wasn't really identified. It looks like a problem in the installation. Have you tried creating a clean environment with the above steps?
I did try a clean install, but the problem persisted. I was, however, able to solve the problem. I was running my script on a SLURM cluster. It turns out that I needed to include srun
in my bash file - sbatch
wasn't enough.
Bug description
I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:
What version are you seeing the problem on?
2.0+ and 1.9.x
How to reproduce the bug
Error messages and logs
Environment
Current environment
Current environment
* CUDA: - GPU: - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - NVIDIA RTX A6000 - available: True - version: 11.7 * Lightning: - lightning: 2.0.1 - lightning-cloud: 0.5.32 - lightning-utilities: 0.7.0 - pytorch-lightning: 1.9.3 - torch: 2.0.0 - torchaudio: 0.13.1 - torchmetrics: 0.11.1 - torchvision: 0.14.1 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - altair: 4.2.2 - anyio: 3.6.2 - appdirs: 1.4.4 - arrow: 1.2.3 - asttokens: 2.2.1 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 22.2.0 - backcall: 0.2.0 - backports.functools-lru-cache: 1.6.4 - beautifulsoup4: 4.12.0 - black: 23.3.0 - blessed: 1.20.0 - brotlipy: 0.7.0 - cachetools: 5.3.0 - certifi: 2022.12.7 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.3 - cmake: 3.26.1 - colorama: 0.4.6 - contourpy: 1.0.7 - croniter: 1.3.8 - cryptography: 38.0.4 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.5.1 - decorator: 5.1.1 - deepdiff: 6.3.0 - deepxde: 1.8.0 - dnspython: 2.3.0 - docker-pycreds: 0.4.0 - email-validator: 1.3.1 - entrypoints: 0.4 - exceptiongroup: 1.1.0 - executing: 1.2.0 - fastapi: 0.88.0 - filelock: 3.10.7 - flatbuffers: 23.1.21 - flit-core: 3.6.0 - fonttools: 4.38.0 - frozenlist: 1.3.3 - fsspec: 2023.1.0 - gast: 0.4.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.16.1 - google-auth-oauthlib: 0.4.6 - google-pasta: 0.2.0 - gpustat: 1.0.0 - grpcio: 1.51.1 - h11: 0.14.0 - h5py: 3.8.0 - hcpdenn: 0.0.1 - httpcore: 0.16.3 - httptools: 0.5.0 - httpx: 0.23.3 - idna: 3.4 - importlib-metadata: 6.0.0 - importlib-resources: 5.12.0 - iniconfig: 2.0.0 - inquirer: 3.1.3 - ipykernel: 6.15.0 - ipython: 8.10.0 - itsdangerous: 2.1.2 - jax: 0.3.25 - jaxlib: 0.3.25+cuda11.cudnn82 - jedi: 0.18.2 - jinja2: 3.1.2 - joblib: 1.2.0 - jsonschema: 4.17.3 - jupyter-client: 7.0.6 - jupyter-core: 4.12.0 - keras: 2.11.0 - kiwisolver: 1.4.4 - libclang: 15.0.6.1 - lightning: 2.0.1 - lightning-cloud: 0.5.32 - lightning-utilities: 0.7.0 - lit: 16.0.0 - markdown: 3.4.1 - markdown-it-py: 2.2.0 - markupsafe: 2.1.2 - matplotlib: 3.7.0 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - ml-dtypes: 0.0.4 - mpmath: 1.3.0 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - nest-asyncio: 1.5.6 - networkx: 3.0 - numpy: 1.23.5 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-ml-py: 11.495.46 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - oauthlib: 3.2.2 - opt-einsum: 3.3.0 - ordered-set: 4.1.0 - orjson: 3.8.9 - packaging: 23.0 - pandas: 1.5.3 - parso: 0.8.3 - pathspec: 0.11.1 - pathtools: 0.1.2 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.3.0 - pip: 22.3.1 - platformdirs: 3.2.0 - pluggy: 1.0.0 - pooch: 1.6.0 - prompt-toolkit: 3.0.36 - protobuf: 3.19.6 - psutil: 5.9.4 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyaml: 21.10.1 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.10.3 - pycparser: 2.21 - pydantic: 1.10.7 - pygments: 2.14.0 - pyjwt: 2.6.0 - pyopenssl: 22.0.0 - pyparsing: 3.0.9 - pyrsistent: 0.19.3 - pysocks: 1.7.1 - pytest: 7.2.1 - python-dateutil: 2.8.2 - python-dotenv: 1.0.0 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 1.9.3 - pytz: 2022.7.1 - pyyaml: 6.0 - pyzmq: 19.0.2 - readchar: 4.0.5 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - rfc3986: 1.5.0 - rich: 13.3.3 - rsa: 4.9 - scienceplots: 2.0.1 - scikit-learn: 1.2.1 - scikit-optimize: 0.9.0 - scikit-sparse: 0.4.8 - scipy: 1.10.1 - seaborn: 0.12.2 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 65.6.3 - six: 1.16.0 - sklearn: 0.0.post1 - smmap: 5.0.0 - sniffio: 1.3.0 - soupsieve: 2.4 - stack-data: 0.6.2 - starlette: 0.22.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tensorboard: 2.11.2 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - tensorflow: 2.11.0 - tensorflow-addons: 0.19.0 - tensorflow-estimator: 2.11.0 - tensorflow-io-gcs-filesystem: 0.30.0 - termcolor: 2.2.0 - theseus-ai: 0.1.4 - threadpoolctl: 3.1.0 - tomli: 2.0.1 - toolz: 0.12.0 - torch: 2.0.0 - torchaudio: 0.13.1 - torchmetrics: 0.11.1 - torchvision: 0.14.1 - tornado: 6.2 - tqdm: 4.64.1 - traitlets: 5.9.0 - triton: 2.0.0 - typeguard: 2.13.3 - typing-extensions: 4.4.0 - ujson: 5.7.0 - urllib3: 1.26.14 - uvicorn: 0.21.1 - uvloop: 0.17.0 - wandb: 0.13.10 - watchfiles: 0.19.0 - wcwidth: 0.2.6 - websocket-client: 1.5.1 - websockets: 10.4 - werkzeug: 2.2.3 - wheel: 0.38.4 - wrapt: 1.14.1 - yarl: 1.8.2 - zipp: 3.14.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.16 - version: #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023More info
No response
cc @justusschock @awaelchli