Closed Riccorl closed 2 years ago
Hello @Riccorl Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?
I'm not sure why you get torch.distributed.is_available() = False
on MacOS, it should be True. It is for me.
Hello @Riccorl Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?
This is the stack trace
Traceback (most recent call last):
File "/Users/ric/Documents/PhD/Projects/invero-xl/invero_xl/train.py", line 7, in <module>
import pytorch_lightning as pl
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
from pytorch_lightning.callbacks.pruning import ModelPruning
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 39, in <module>
from pytorch_lightning.trainer.connectors.logger_connector.fx_validator import _FxValidator
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
from pytorch_lightning.trainer.trainer import Trainer
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 30, in <module>
from pytorch_lightning.accelerators import Accelerator, IPUAccelerator
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 13, in <module>
from pytorch_lightning.accelerators.accelerator import Accelerator # noqa: F401
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 26, in <module>
from pytorch_lightning.plugins.precision import ApexMixedPrecisionPlugin, NativeMixedPrecisionPlugin, PrecisionPlugin
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 8, in <module>
from pytorch_lightning.plugins.plugins_registry import ( # noqa: F401
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/plugins_registry.py", line 20, in <module>
from pytorch_lightning.plugins.training_type.training_type_plugin import TrainingTypePlugin
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
from pytorch_lightning.plugins.training_type.ddp import DDPPlugin # noqa: F401
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 68, in <module>
from torch.distributed.optim import DistributedOptimizer
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/__init__.py", line 37, in <module>
from .post_localSGD_optimizer import PostLocalSGDOptimizer
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/post_localSGD_optimizer.py", line 2, in <module>
import torch.distributed.algorithms.model_averaging.averagers as averagers
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
import torch.distributed.algorithms.model_averaging.utils as utils
File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'
I'm not sure why you get
torch.distributed.is_available() = False
on MacOS, it should be True. It is for me.
I installed PyTorch like this:
conda install pytorch -c pytorch
But I guess that the problem is the ARM build (I'm on an M1 cpu).
We can fix this easily as the error comes from a typing annotation, but we'll also have to add a M1 CI job when it becomes available.
Current init_dist_connection() will do nothing if torch.distributed.is_avalible = False. To wrap model with DistributedDataParallel()
, something like torch.distributed.init_process_group( backend='nccl', world_size=N, init_method='...' )
Is required right?
Should we raise exceptions at set_distributed() in ddp if torch.distributed.is_avalible = False?
@four4fish Yes you are right. I tested it and we get this message:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
I agree, for users like @Riccorl we could inform them directly that DDP is not available when torch.distributed.is_avalible = False
.
Btw, doesn't your PR directly solve the import problem of this issue? I couldn't find other places.
@Riccorl I had a similar issue on M1 macbook, it only happens with pytorch=1.10. Downgrading torch to '1.9.1.post3' resolved the issue for me.
@awaelchli I think after the import PR, import lightning didn't fail. But when trainer calls ddp setup_distributed(), which calls init_dist_connection() will check torch.distributed.is_avalible
before create process group. Because torch.distributed.is_avalible=False
no process_group was created, so the future ddp will fail. Where is this runtime error happens exactly? when wrap the model?
I was proposing: should we throw exception in init_dist_connect() If torch.distributed.is_avalible=false
?
I encountered this same issue. I'm building PyTorch Lightning 1.5.0 and PyTorch 1.10.0 from source using the Spack package manager on macOS 10.15.7. Unfortunately, PyTorch distributed doesn't seem to build for me on macOS: https://github.com/pytorch/pytorch/issues/68002
It sounds like requiring distributed support was an accident and will be removed in future releases. Let me know which PR solves this and I'll add a patch to the 1.5.0 release in Spack.
@four4fish your PR (#10418) says "partially fixes".
Do we need to re-open this? What's left for us to do here?
I just tried again with PyTorch Lightning 1.5.2 and I'm still seeing numerous issues if PyTorch isn't installed with distributed support.
@adamjstewart I also tested this with PL 1.5.2 and I had no issues. Can you give us your torch version and a reproducible script?
@justusschock sure, my environment looks like:
In order to reproduce this issue, PyTorch must be installed without distributed support:
$ python
>>> import torch
>>> torch.distributed.is_available()
False
This is commonly the case on macOS. Then, the issue (which now looks different than it did in 1.5.0) can be reproduced like so:
$ python
>>> from pytorch_lightning.core.lightning import LightningModule
>>> LightningModule()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 122, in __init__
self._register_sharded_tensor_state_dict_hooks_if_available()
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 2065, in _register_sharded_tensor_state_dict_hooks_if_available
from torch.distributed._sharded_tensor import pre_load_state_dict_hook, state_dict_hook
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharded_tensor/__init__.py", line 5, in <module>
from torch.distributed._sharding_spec import (
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/__init__.py", line 1, in <module>
from .api import (
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 21, in <module>
class DevicePlacementSpec(PlacementSpec):
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 29, in DevicePlacementSpec
device: torch.distributed._remote_device
AttributeError: module 'torch.distributed' has no attribute '_remote_device'
That error arises due to the automatic registration support for sharded tensors here: https://github.com/PyTorchLightning/pytorch-lightning/blob/2c7c4aab8087d4c1c99c57c7acc66ef9a8e815d4/pytorch_lightning/core/lightning.py#L1988-L1994
We should check if torch distributed is available before importing in that function's implementation
Just wanted to follow up on this and say that all issues I was encountering with non-distributed PyTorch seem to be fixed in 1.5.3. Thanks @ananthsub @four4fish and everyone else involved in fixing these!
@adamjstewart I'm using Mac with M1 and version 1.5.3 and still get error: ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' when trying to import pytorch_lightning, have you done anything else in order to solve this?
Hmm, 1.5.3 just worked for me, no hacks required. Are you sure you're using 1.5.3? You might be hitting a different part of the code than me. Can you share the error message stack trace?
Yes, I'm sure I use 1.5.3 this is the stack trace for trying to import pytorch lightning :
Traceback (most recent call last): File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "
", line 1, in import pytorch_lightning as pl File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/init.py", line 20, in from pytorch_lightning.callbacks import Callback # noqa: E402 File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/init.py", line 14, in kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/base.py", line 26, infrom pytorch_lightning.callbacks.base import Callback File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args,from pytorch_lightning.utilities.types import STEP_OUTPUT File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/init.py", line 18, in from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401 File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 82, infrom pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args,_FAIRSCALE_AVAILABLE = not _IS_WINDOWS and _module_available("fairscale.nn") File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 38, in _module_available return find_spec(module_path) is not None File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/importlib/util.py", line 94, in find_spec parent = import(parent_name, fromlist=['path']) File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/init.py", line 15, in from . import nn File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, *kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/init.py", line 9, in kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/init.py", line 8, infrom .data_parallel import FullyShardedDataParallel, ShardedDataParallel File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, args,from .fully_sharded_data_parallel import FullyShardedDataParallel, TrainingState, auto_wrap_bn File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import module = self._system_import(name, *args, **kwargs) File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 34, in from torch.distributed import ProcessGroup ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' (/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/torch/distributed/init.py)
@AdirRahamim that's caused by the same problem described in this issue but for the fairscale
repository: https://github.com/facebookresearch/fairscale
You can raise this issue on their repository. You can also uninstall the dependency assuming you are not using it. Uninstalling it means it will not get imported so you won't get the failure.
pip uninstall fairscale
@carmocca Thanks! indeed uninstalling the package solved the problem.
I'm still experiencing this issue on PyTorch lightning v1.6.0 and PyTorch v1.11.0. Furthermore,torch.distributed.is_available()
evaluates to False
. Does this have something to do with the fact that I installed the dependencies with miniforge and therefore from conda-forge?
@schiegl can you share the full error stacktrace?
@carmocca This is the stack trace I get when I import PyTorch lightning with the following environment.yml
name: pl_error
channels:
- defaults
- pytorch
- conda-forge
dependencies:
- python=3.9
- numpy=1.21.2
- pytorch=1.11
- pytorch-lightning=1.6
Import error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
from pytorch_lightning.callbacks.pruning import ModelPruning
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
from pytorch_lightning.trainer.connectors.data_connector import _DataHookSelector
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
from pytorch_lightning.trainer.trainer import Trainer
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 34, in <module>
from pytorch_lightning.accelerators import Accelerator, GPUAccelerator, HPUAccelerator, IPUAccelerator, TPUAccelerator
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 14, in <module>
from pytorch_lightning.accelerators.cpu import CPUAccelerator # noqa: F401
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/cpu.py", line 19, in <module>
from pytorch_lightning.utilities import device_parser
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 18, in <module>
from pytorch_lightning.plugins.environments import TorchElasticEnvironment
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 20, in <module>
from pytorch_lightning.plugins.training_type.ddp import DDPPlugin
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
from pytorch_lightning.plugins.training_type.ddp import DDPPlugin # noqa: F401
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 14, in <module>
from pytorch_lightning.strategies import DDPStrategy
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/__init__.py", line 14, in <module>
from pytorch_lightning.strategies.bagua import BaguaStrategy # noqa: F401
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/bagua.py", line 17, in <module>
from pytorch_lightning.strategies.ddp import DDPStrategy
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 66, in <module>
from torch.distributed.algorithms.model_averaging.averagers import ModelAverager
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
import torch.distributed.algorithms.model_averaging.utils as utils
File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'
@schiegl @carmocca fwiw, I was also facing this issue on 1.6.0 Downgrading to 1.5.3 fixed it for me though.
🐛 Bug
When importing PyTorch Lightning, it throws an
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'
. I guess it comes from the fact that I am on macOS (M1) and PyTorch does not providetorch.distributed
with its pre-built package. Indeed,torch.distributed.is_available()
isFalse
.To Reproduce
Environment
conda