Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.37k stars 3.38k forks source link

Module 'torch.distributed' has no attribute 'ProcessGroup' when importing PyTorch Lightning #10348

Closed Riccorl closed 2 years ago

Riccorl commented 3 years ago

🐛 Bug

When importing PyTorch Lightning, it throws an AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'. I guess it comes from the fact that I am on macOS (M1) and PyTorch does not provide torch.distributed with its pre-built package. Indeed, torch.distributed.is_available() is False.

To Reproduce

import pytorch_lightning

Environment

awaelchli commented 3 years ago

Hello @Riccorl Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?

awaelchli commented 3 years ago

I'm not sure why you get torch.distributed.is_available() = False on MacOS, it should be True. It is for me.

Riccorl commented 3 years ago

Hello @Riccorl Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?

This is the stack trace

Traceback (most recent call last):
  File "/Users/ric/Documents/PhD/Projects/invero-xl/invero_xl/train.py", line 7, in <module>
    import pytorch_lightning as pl
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 39, in <module>
    from pytorch_lightning.trainer.connectors.logger_connector.fx_validator import _FxValidator
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 30, in <module>
    from pytorch_lightning.accelerators import Accelerator, IPUAccelerator
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 13, in <module>
    from pytorch_lightning.accelerators.accelerator import Accelerator  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 26, in <module>
    from pytorch_lightning.plugins.precision import ApexMixedPrecisionPlugin, NativeMixedPrecisionPlugin, PrecisionPlugin
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 8, in <module>
    from pytorch_lightning.plugins.plugins_registry import (  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/plugins_registry.py", line 20, in <module>
    from pytorch_lightning.plugins.training_type.training_type_plugin import TrainingTypePlugin
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 68, in <module>
    from torch.distributed.optim import DistributedOptimizer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/__init__.py", line 37, in <module>
    from .post_localSGD_optimizer import PostLocalSGDOptimizer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/post_localSGD_optimizer.py", line 2, in <module>
    import torch.distributed.algorithms.model_averaging.averagers as averagers
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
    import torch.distributed.algorithms.model_averaging.utils as utils
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
    params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'

I'm not sure why you get torch.distributed.is_available() = False on MacOS, it should be True. It is for me.

I installed PyTorch like this:

conda install pytorch -c pytorch

But I guess that the problem is the ARM build (I'm on an M1 cpu).

carmocca commented 3 years ago

We can fix this easily as the error comes from a typing annotation, but we'll also have to add a M1 CI job when it becomes available.

four4fish commented 3 years ago

Current init_dist_connection() will do nothing if torch.distributed.is_avalible = False. To wrap model with DistributedDataParallel(), something like torch.distributed.init_process_group( backend='nccl', world_size=N, init_method='...' ) Is required right? Should we raise exceptions at set_distributed() in ddp if torch.distributed.is_avalible = False?

awaelchli commented 3 years ago

@four4fish Yes you are right. I tested it and we get this message:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I agree, for users like @Riccorl we could inform them directly that DDP is not available when torch.distributed.is_avalible = False.

Btw, doesn't your PR directly solve the import problem of this issue? I couldn't find other places.

anuragsingh31 commented 3 years ago

@Riccorl I had a similar issue on M1 macbook, it only happens with pytorch=1.10. Downgrading torch to '1.9.1.post3' resolved the issue for me.

four4fish commented 3 years ago

@awaelchli I think after the import PR, import lightning didn't fail. But when trainer calls ddp setup_distributed(), which calls init_dist_connection() will check torch.distributed.is_avalible before create process group. Because torch.distributed.is_avalible=False no process_group was created, so the future ddp will fail. Where is this runtime error happens exactly? when wrap the model?

I was proposing: should we throw exception in init_dist_connect() If torch.distributed.is_avalible=false ?

adamjstewart commented 3 years ago

I encountered this same issue. I'm building PyTorch Lightning 1.5.0 and PyTorch 1.10.0 from source using the Spack package manager on macOS 10.15.7. Unfortunately, PyTorch distributed doesn't seem to build for me on macOS: https://github.com/pytorch/pytorch/issues/68002

It sounds like requiring distributed support was an accident and will be removed in future releases. Let me know which PR solves this and I'll add a patch to the 1.5.0 release in Spack.

carmocca commented 3 years ago

@four4fish your PR (#10418) says "partially fixes".

Do we need to re-open this? What's left for us to do here?

adamjstewart commented 2 years ago

I just tried again with PyTorch Lightning 1.5.2 and I'm still seeing numerous issues if PyTorch isn't installed with distributed support.

justusschock commented 2 years ago

@adamjstewart I also tested this with PL 1.5.2 and I had no issues. Can you give us your torch version and a reproducible script?

adamjstewart commented 2 years ago

@justusschock sure, my environment looks like:

In order to reproduce this issue, PyTorch must be installed without distributed support:

$ python
>>> import torch
>>> torch.distributed.is_available()
False

This is commonly the case on macOS. Then, the issue (which now looks different than it did in 1.5.0) can be reproduced like so:

$ python
>>> from pytorch_lightning.core.lightning import LightningModule
>>> LightningModule()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 122, in __init__
    self._register_sharded_tensor_state_dict_hooks_if_available()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 2065, in _register_sharded_tensor_state_dict_hooks_if_available
    from torch.distributed._sharded_tensor import pre_load_state_dict_hook, state_dict_hook
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharded_tensor/__init__.py", line 5, in <module>
    from torch.distributed._sharding_spec import (
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/__init__.py", line 1, in <module>
    from .api import (
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 21, in <module>
    class DevicePlacementSpec(PlacementSpec):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 29, in DevicePlacementSpec
    device: torch.distributed._remote_device
AttributeError: module 'torch.distributed' has no attribute '_remote_device'
ananthsub commented 2 years ago

That error arises due to the automatic registration support for sharded tensors here: https://github.com/PyTorchLightning/pytorch-lightning/blob/2c7c4aab8087d4c1c99c57c7acc66ef9a8e815d4/pytorch_lightning/core/lightning.py#L1988-L1994

We should check if torch distributed is available before importing in that function's implementation

adamjstewart commented 2 years ago

Just wanted to follow up on this and say that all issues I was encountering with non-distributed PyTorch seem to be fixed in 1.5.3. Thanks @ananthsub @four4fish and everyone else involved in fixing these!

AdirRahamim commented 2 years ago

@adamjstewart I'm using Mac with M1 and version 1.5.3 and still get error: ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' when trying to import pytorch_lightning, have you done anything else in order to solve this?

adamjstewart commented 2 years ago

Hmm, 1.5.3 just worked for me, no hacks required. Are you sure you're using 1.5.3? You might be hitting a different part of the code than me. Can you share the error message stack trace?

AdirRahamim commented 2 years ago

Yes, I'm sure I use 1.5.3 this is the stack trace for trying to import pytorch lightning :

Traceback (most recent call last): File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in import pytorch_lightning as pl File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/init.py", line 20, in from pytorch_lightning.callbacks import Callback # noqa: E402 File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/init.py", line 14, in from pytorch_lightning.callbacks.base import Callback File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/base.py", line 26, in from pytorch_lightning.utilities.types import STEP_OUTPUT File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/init.py", line 18, in from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401 File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 82, in _FAIRSCALE_AVAILABLE = not _IS_WINDOWS and _module_available("fairscale.nn") File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 38, in _module_available return find_spec(module_path) is not None File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/importlib/util.py", line 94, in find_spec parent = import(parent_name, fromlist=['path']) File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/init.py", line 15, in from . import nn File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/init.py", line 9, in from .data_parallel import FullyShardedDataParallel, ShardedDataParallel File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/init.py", line 8, in from .fully_sharded_data_parallel import FullyShardedDataParallel, TrainingState, auto_wrap_bn File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, **kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 34, in from torch.distributed import ProcessGroup ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' (/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/torch/distributed/init.py)

carmocca commented 2 years ago

@AdirRahamim that's caused by the same problem described in this issue but for the fairscale repository: https://github.com/facebookresearch/fairscale

You can raise this issue on their repository. You can also uninstall the dependency assuming you are not using it. Uninstalling it means it will not get imported so you won't get the failure.

pip uninstall fairscale

AdirRahamim commented 2 years ago

@carmocca Thanks! indeed uninstalling the package solved the problem.

schiegl commented 2 years ago

I'm still experiencing this issue on PyTorch lightning v1.6.0 and PyTorch v1.11.0. Furthermore,torch.distributed.is_available() evaluates to False. Does this have something to do with the fact that I installed the dependencies with miniforge and therefore from conda-forge?

carmocca commented 2 years ago

@schiegl can you share the full error stacktrace?

schiegl commented 2 years ago

@carmocca This is the stack trace I get when I import PyTorch lightning with the following environment.yml

name: pl_error
channels:
  - defaults
  - pytorch
  - conda-forge

dependencies:
  - python=3.9
  - numpy=1.21.2
  - pytorch=1.11
  - pytorch-lightning=1.6

Import error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
    from pytorch_lightning.trainer.connectors.data_connector import _DataHookSelector
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 34, in <module>
    from pytorch_lightning.accelerators import Accelerator, GPUAccelerator, HPUAccelerator, IPUAccelerator, TPUAccelerator
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 14, in <module>
    from pytorch_lightning.accelerators.cpu import CPUAccelerator  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/cpu.py", line 19, in <module>
    from pytorch_lightning.utilities import device_parser
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 18, in <module>
    from pytorch_lightning.plugins.environments import TorchElasticEnvironment
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 20, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 14, in <module>
    from pytorch_lightning.strategies import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/__init__.py", line 14, in <module>
    from pytorch_lightning.strategies.bagua import BaguaStrategy  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/bagua.py", line 17, in <module>
    from pytorch_lightning.strategies.ddp import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 66, in <module>
    from torch.distributed.algorithms.model_averaging.averagers import ModelAverager
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
    import torch.distributed.algorithms.model_averaging.utils as utils
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
    params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'
JasonTam commented 2 years ago

@schiegl @carmocca fwiw, I was also facing this issue on 1.6.0 Downgrading to 1.5.3 fixed it for me though.