Closed amansingh427 closed 1 year ago
Hey @amansingh427
In the latest Lightning versions, the backend can no longer be set through the environment variable PL_TORCH_DISTRIBUTED_BACKEND
. You can set it like so:
from lightning.pytorch.strategies import DDPStrategy
trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"), ...)
Fixed. Thanks!
Bug description
I am trying to run a training module with CUDA using PyTorch Lightning, but Lightning keeps trying to use NCCL. I have tried every solution I have found online, from specifying it in the code to prepending
PL_TORCH_DISTRIBUTED_BACKEND=gloo
to the laucnh command in the terminal, but Lightning still seems to try to use NCCL. I have verified that gloo is available for use in my system. Any help would be greatly appreciated.What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA TITAN X (Pascal) - NVIDIA GeForce GTX 970 - available: True - version: 11.8 * Lightning: - lightning: 2.0.8 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.8 - pytorchvideo: 0.1.5 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 1.1.1 - torchvision: 0.15.2 * Packages: - aiofiles: 22.1.0 - aiohttp: 3.8.5 - aiosignal: 1.3.1 - aiosqlite: 0.18.0 - annotated-types: 0.5.0 - ansicon: 1.89.0 - anyio: 3.5.0 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.2.3 - asttokens: 2.0.5 - async-timeout: 4.0.3 - attrs: 22.1.0 - av: 10.0.0 - babel: 2.11.0 - backcall: 0.2.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - bleach: 4.1.0 - blessed: 1.20.0 - boto3: 1.28.42 - botocore: 1.31.42 - bottleneck: 1.3.5 - brotlipy: 0.7.0 - certifi: 2023.7.22 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.7 - colorama: 0.4.6 - comm: 0.1.2 - contourpy: 1.0.5 - croniter: 1.4.1 - cryptography: 41.0.2 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.6.7 - decorator: 5.1.1 - deepdiff: 6.4.1 - deeplake: 3.6.23 - defusedxml: 0.7.1 - dill: 0.3.7 - einops: 0.6.1 - entrypoints: 0.4 - executing: 0.8.3 - fastapi: 0.103.1 - fastjsonschema: 2.16.2 - filelock: 3.12.3 - fonttools: 4.25.0 - frozenlist: 1.4.0 - fsspec: 2023.9.0 - fvcore: 0.1.5.post20221221 - h11: 0.14.0 - huggingface-hub: 0.17.1 - humbug: 0.3.2 - idna: 3.4 - inquirer: 3.1.3 - iopath: 0.1.10 - ipykernel: 6.25.0 - ipython: 8.12.2 - ipython-genutils: 0.2.0 - ipywidgets: 8.0.4 - itsdangerous: 2.1.2 - jedi: 0.18.1 - jinja2: 3.1.2 - jinxed: 1.2.0 - jmespath: 1.0.1 - joblib: 1.2.0 - json5: 0.9.6 - jsonschema: 4.17.3 - jupyter: 1.0.0 - jupyter-client: 7.4.9 - jupyter-console: 6.6.3 - jupyter-core: 5.3.0 - jupyter-events: 0.6.3 - jupyter-server: 1.23.4 - jupyter-server-fileid: 0.9.0 - jupyter-server-ydoc: 0.8.0 - jupyter-ydoc: 0.2.4 - jupyterlab: 3.6.3 - jupyterlab-pygments: 0.1.2 - jupyterlab-server: 2.22.0 - jupyterlab-widgets: 3.0.5 - kiwisolver: 1.4.4 - lightning: 2.0.8 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - llvmlite: 0.40.0 - lxml: 4.9.2 - markdown-it-py: 3.0.0 - markupsafe: 2.1.1 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mistune: 0.8.4 - mkl-fft: 1.3.6 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.3.0 - multidict: 6.0.4 - multiprocess: 0.70.15 - munkres: 1.1.4 - nbclassic: 0.5.5 - nbclient: 0.5.13 - nbconvert: 6.5.4 - nbformat: 5.7.0 - nest-asyncio: 1.5.6 - networkx: 3.1 - notebook: 6.5.4 - notebook-shim: 0.2.2 - numba: 0.57.0 - numcodecs: 0.11.0 - numexpr: 2.8.4 - numpy: 1.24.3 - ordered-set: 4.1.0 - packaging: 23.1 - pandas: 2.0.3 - pandocfilters: 1.5.0 - parameterized: 0.9.0 - parso: 0.8.3 - pathos: 0.3.1 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - platformdirs: 3.10.0 - ply: 3.11 - portalocker: 2.7.0 - pox: 0.3.3 - ppft: 1.7.6.7 - pretty-errors: 1.2.25 - prometheus-client: 0.14.1 - prompt-toolkit: 3.0.36 - psutil: 5.9.0 - pure-eval: 0.2.2 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pygments: 2.15.1 - pyjwt: 2.8.0 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyqt5: 5.15.7 - pyqt5-sip: 12.11.0 - pyrsistent: 0.18.0 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-json-logger: 2.0.7 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.8 - pytorchvideo: 0.1.5 - pytz: 2022.7 - pywin32: 305.1 - pywinpty: 2.0.10 - pyyaml: 6.0 - pyzmq: 23.2.0 - qtconsole: 5.4.2 - qtpy: 2.2.0 - readchar: 4.0.5 - regex: 2023.8.8 - requests: 2.31.0 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.5.2 - s3transfer: 0.6.2 - safetensors: 0.3.3 - scikit-learn: 1.2.2 - scipy: 1.11.1 - send2trash: 1.8.0 - setuptools: 68.0.0 - sip: 6.6.2 - six: 1.16.0 - sniffio: 1.2.0 - soupsieve: 2.4 - stack-data: 0.2.0 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tabulate: 0.9.0 - termcolor: 2.3.0 - terminado: 0.17.1 - threadpoolctl: 2.2.0 - tinycss2: 1.2.1 - tokenizers: 0.13.3 - toml: 0.10.2 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 1.1.1 - torchvision: 0.15.2 - tornado: 6.3.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - transformers: 4.33.1 - typing-extensions: 4.7.1 - tzdata: 2023.3 - urllib3: 1.26.16 - uvicorn: 0.23.2 - wcwidth: 0.2.5 - webencodings: 0.5.1 - websocket-client: 0.58.0 - websockets: 11.0.3 - wheel: 0.38.4 - widgetsnbextension: 4.0.5 - win-inet-pton: 1.1.0 - y-py: 0.5.9 - yacs: 0.1.8 - yarl: 1.9.2 - ypy-websocket: 0.8.2 * System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: Intel64 Family 6 Model 85 Stepping 4, GenuineIntel - python: 3.11.4 - release: 10 - version: 10.0.19041cc @justusschock @awaelchli