Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.36k stars 3.39k forks source link

AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup' #12725

Closed carmocca closed 2 years ago

carmocca commented 2 years ago

@carmocca This is the stack trace I get when I import PyTorch lightning with the following environment.yml

name: pl_error
channels:
  - defaults
  - pytorch
  - conda-forge

dependencies:
  - python=3.9
  - numpy=1.21.2
  - pytorch=1.11
  - pytorch-lightning=1.6

Import error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
    from pytorch_lightning.trainer.connectors.data_connector import _DataHookSelector
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 34, in <module>
    from pytorch_lightning.accelerators import Accelerator, GPUAccelerator, HPUAccelerator, IPUAccelerator, TPUAccelerator
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 14, in <module>
    from pytorch_lightning.accelerators.cpu import CPUAccelerator  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/cpu.py", line 19, in <module>
    from pytorch_lightning.utilities import device_parser
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 18, in <module>
    from pytorch_lightning.plugins.environments import TorchElasticEnvironment
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 20, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 14, in <module>
    from pytorch_lightning.strategies import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/__init__.py", line 14, in <module>
    from pytorch_lightning.strategies.bagua import BaguaStrategy  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/bagua.py", line 17, in <module>
    from pytorch_lightning.strategies.ddp import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 66, in <module>
    from torch.distributed.algorithms.model_averaging.averagers import ModelAverager
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
    import torch.distributed.algorithms.model_averaging.utils as utils
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
    params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'

Originally posted by @schiegl in https://github.com/PyTorchLightning/pytorch-lightning/issues/10348#issuecomment-1095287462

carmocca commented 2 years ago

@krshrimali can you take care of this?

This is a bug inside PyTorch as seen in the stacktrace:

File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup

but we still can gate our import with a distributed_available check here:

File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 66, in from torch.distributed.algorithms.model_averaging.averagers import ModelAverager

So this will require one PR to this repo and another one to PyTorch