Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.03k stars 3.36k forks source link

Can't seem to change distributed backend to gloo on Windows #18589

Closed amansingh427 closed 1 year ago

amansingh427 commented 1 year ago

Bug description

I am trying to run a training module with CUDA using PyTorch Lightning, but Lightning keeps trying to use NCCL. I have tried every solution I have found online, from specifying it in the code to prepending PL_TORCH_DISTRIBUTED_BACKEND=gloo to the laucnh command in the terminal, but Lightning still seems to try to use NCCL. I have verified that gloo is available for use in my system. Any help would be greatly appreciated.

What version are you seeing the problem on?

master

How to reproduce the bug

os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
my_data = MyDataModule(args...)
my_model = MyModel(args...)
trainer = Trainer()
trainer.fit(my_model, my_data.train_dataloader, my_data.val_dataloader)

# also pops up when running PL_TORCH_DISTRIBUTED_BACKEND=gloo python train.py

Error messages and logs

$ PL_TORCH_DISTRIBUTED_BACKEND=gloo python train.py
C:\Users\user\AppData\Local\anaconda3\envs\env\Lib\site-packages\torchaudio\backend\utils.py:74: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\user\AppData\Local\anaconda3\envs\env\Lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:67: UserWarnin
g: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the
ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip
 install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
C:\Users\user\AppData\Local\anaconda3\envs\env\Lib\site-packages\pytorch_lightning\loops\utilities.py:72: PossibleUserWarning: `max_epochs` was not set.
Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  rank_zero_warn(
C:\Users\user\AppData\Local\anaconda3\envs\env\Lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:69: UserWarning: You passed in a `v
al_dataloader` but have no `validation_step`. Skipping val loop.
  rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [system.intranet.company.ne
t]:52432 (system error: 10049 - The requested address is not valid in its context.).
C:\Users\user\AppData\Local\anaconda3\envs\env\Lib\site-packages\torchaudio\backend\utils.py:74: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [system.intranet.company.ne
t]:52432 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [system.intranet.company.ne
t]:52432 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [system.intranet.company.ne
t]:52432 (system error: 10049 - The requested address is not valid in its context.).

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
train.py 62 <module>
trainer.fit(my_model, my_data.train_dataloader, my_data.val_dataloader)

trainer.py 532 fit

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
train.py 62 <module>
trainer.fit(my_model, my_data.train_dataloader, my_data.val_dataloader)

trainer.py 532 fit
call._call_and_handle_interrupt(

call.py 42 _call_and_handle_interrupt
call._call_and_handle_interrupt(

call.py 42 _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)

subprocess_script.py 93 launch
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)

return function(*args, **kwargs)
subprocess_script.py 93 launch

trainer.py 571 _fit_impl
self._run(model, ckpt_path=ckpt_path)

trainer.py 938 _run
self.strategy.setup_environment()

ddp.py 143 setup_environment
return function(*args, **kwargs)

trainer.py 571 _fit_impl
self._run(model, ckpt_path=ckpt_path)

trainer.py 938 _run
self.strategy.setup_environment()
self.setup_distributed()

ddp.py 143 setup_environment
ddp.py 191 setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)

distributed.py 258 _init_dist_connection
self.setup_distributed()

torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
ddp.py 191 setup_distributed

distributed_c10d.py 907 init_process_group
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)

distributed.py 258 _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)

distributed_c10d.py 907 init_process_group
default_pg = _new_process_group_helper(

distributed_c10d.py 1013 _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")

RuntimeError:
Distributed package doesn't have NCCL built in
default_pg = _new_process_group_helper(

distributed_c10d.py 1013 _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")

RuntimeError:
Distributed package doesn't have NCCL built in

Environment

Current environment * CUDA: - GPU: - NVIDIA TITAN X (Pascal) - NVIDIA GeForce GTX 970 - available: True - version: 11.8 * Lightning: - lightning: 2.0.8 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.8 - pytorchvideo: 0.1.5 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 1.1.1 - torchvision: 0.15.2 * Packages: - aiofiles: 22.1.0 - aiohttp: 3.8.5 - aiosignal: 1.3.1 - aiosqlite: 0.18.0 - annotated-types: 0.5.0 - ansicon: 1.89.0 - anyio: 3.5.0 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.2.3 - asttokens: 2.0.5 - async-timeout: 4.0.3 - attrs: 22.1.0 - av: 10.0.0 - babel: 2.11.0 - backcall: 0.2.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - bleach: 4.1.0 - blessed: 1.20.0 - boto3: 1.28.42 - botocore: 1.31.42 - bottleneck: 1.3.5 - brotlipy: 0.7.0 - certifi: 2023.7.22 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.7 - colorama: 0.4.6 - comm: 0.1.2 - contourpy: 1.0.5 - croniter: 1.4.1 - cryptography: 41.0.2 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.6.7 - decorator: 5.1.1 - deepdiff: 6.4.1 - deeplake: 3.6.23 - defusedxml: 0.7.1 - dill: 0.3.7 - einops: 0.6.1 - entrypoints: 0.4 - executing: 0.8.3 - fastapi: 0.103.1 - fastjsonschema: 2.16.2 - filelock: 3.12.3 - fonttools: 4.25.0 - frozenlist: 1.4.0 - fsspec: 2023.9.0 - fvcore: 0.1.5.post20221221 - h11: 0.14.0 - huggingface-hub: 0.17.1 - humbug: 0.3.2 - idna: 3.4 - inquirer: 3.1.3 - iopath: 0.1.10 - ipykernel: 6.25.0 - ipython: 8.12.2 - ipython-genutils: 0.2.0 - ipywidgets: 8.0.4 - itsdangerous: 2.1.2 - jedi: 0.18.1 - jinja2: 3.1.2 - jinxed: 1.2.0 - jmespath: 1.0.1 - joblib: 1.2.0 - json5: 0.9.6 - jsonschema: 4.17.3 - jupyter: 1.0.0 - jupyter-client: 7.4.9 - jupyter-console: 6.6.3 - jupyter-core: 5.3.0 - jupyter-events: 0.6.3 - jupyter-server: 1.23.4 - jupyter-server-fileid: 0.9.0 - jupyter-server-ydoc: 0.8.0 - jupyter-ydoc: 0.2.4 - jupyterlab: 3.6.3 - jupyterlab-pygments: 0.1.2 - jupyterlab-server: 2.22.0 - jupyterlab-widgets: 3.0.5 - kiwisolver: 1.4.4 - lightning: 2.0.8 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - llvmlite: 0.40.0 - lxml: 4.9.2 - markdown-it-py: 3.0.0 - markupsafe: 2.1.1 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mistune: 0.8.4 - mkl-fft: 1.3.6 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.3.0 - multidict: 6.0.4 - multiprocess: 0.70.15 - munkres: 1.1.4 - nbclassic: 0.5.5 - nbclient: 0.5.13 - nbconvert: 6.5.4 - nbformat: 5.7.0 - nest-asyncio: 1.5.6 - networkx: 3.1 - notebook: 6.5.4 - notebook-shim: 0.2.2 - numba: 0.57.0 - numcodecs: 0.11.0 - numexpr: 2.8.4 - numpy: 1.24.3 - ordered-set: 4.1.0 - packaging: 23.1 - pandas: 2.0.3 - pandocfilters: 1.5.0 - parameterized: 0.9.0 - parso: 0.8.3 - pathos: 0.3.1 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - platformdirs: 3.10.0 - ply: 3.11 - portalocker: 2.7.0 - pox: 0.3.3 - ppft: 1.7.6.7 - pretty-errors: 1.2.25 - prometheus-client: 0.14.1 - prompt-toolkit: 3.0.36 - psutil: 5.9.0 - pure-eval: 0.2.2 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pygments: 2.15.1 - pyjwt: 2.8.0 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyqt5: 5.15.7 - pyqt5-sip: 12.11.0 - pyrsistent: 0.18.0 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-json-logger: 2.0.7 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.8 - pytorchvideo: 0.1.5 - pytz: 2022.7 - pywin32: 305.1 - pywinpty: 2.0.10 - pyyaml: 6.0 - pyzmq: 23.2.0 - qtconsole: 5.4.2 - qtpy: 2.2.0 - readchar: 4.0.5 - regex: 2023.8.8 - requests: 2.31.0 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.5.2 - s3transfer: 0.6.2 - safetensors: 0.3.3 - scikit-learn: 1.2.2 - scipy: 1.11.1 - send2trash: 1.8.0 - setuptools: 68.0.0 - sip: 6.6.2 - six: 1.16.0 - sniffio: 1.2.0 - soupsieve: 2.4 - stack-data: 0.2.0 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tabulate: 0.9.0 - termcolor: 2.3.0 - terminado: 0.17.1 - threadpoolctl: 2.2.0 - tinycss2: 1.2.1 - tokenizers: 0.13.3 - toml: 0.10.2 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 1.1.1 - torchvision: 0.15.2 - tornado: 6.3.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - transformers: 4.33.1 - typing-extensions: 4.7.1 - tzdata: 2023.3 - urllib3: 1.26.16 - uvicorn: 0.23.2 - wcwidth: 0.2.5 - webencodings: 0.5.1 - websocket-client: 0.58.0 - websockets: 11.0.3 - wheel: 0.38.4 - widgetsnbextension: 4.0.5 - win-inet-pton: 1.1.0 - y-py: 0.5.9 - yacs: 0.1.8 - yarl: 1.9.2 - ypy-websocket: 0.8.2 * System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: Intel64 Family 6 Model 85 Stepping 4, GenuineIntel - python: 3.11.4 - release: 10 - version: 10.0.19041

cc @justusschock @awaelchli

awaelchli commented 1 year ago

Hey @amansingh427 In the latest Lightning versions, the backend can no longer be set through the environment variable PL_TORCH_DISTRIBUTED_BACKEND. You can set it like so:

from lightning.pytorch.strategies import DDPStrategy

trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"), ...)
amansingh427 commented 1 year ago

Fixed. Thanks!