NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.16k stars 1.35k forks source link

Running apex with error: AttributeError: module 'torch.distributed' has no attribute '_reduce_scatter_base' #1773

Open cs-wangfeng opened 5 months ago

cs-wangfeng commented 5 months ago

Describe the Bug I'm running a program with apex in my anaconda3 environment. But meet with the following error:

...
  File ".../anaconda3/envs/valor/lib/python3.9/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 14, in <module>
    from apex.transformer.tensor_parallel.layers import (
  File ".../anaconda3/envs/valor/lib/python3.9/site-packages/apex/transformer/tensor_parallel/__init__.py", line 21, in <module>
    from apex.transformer.tensor_parallel.layers import (
  File ".../anaconda3/envs/valor/lib/python3.9/site-packages/apex/transformer/tensor_parallel/layers.py", line 32, in <module>
    from apex.transformer.tensor_parallel.mappings import (
  File ".../anaconda3/envs/valor/lib/python3.9/site-packages/apex/transformer/tensor_parallel/mappings.py", line 29, in <module>
    torch.distributed.reduce_scatter_tensor = torch.distributed._reduce_scatter_base
AttributeError: module 'torch.distributed' has no attribute '_reduce_scatter_base'

Minimal Steps/Code to Reproduce the Bug I installed apex with the following steps:

git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

I also tried with the following steps:

git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

or

git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

But the methods all don't work.

Environment

Here is my environment info:

Python-3.9.12 pip-23.3.1 pytorch-1.9.0 cuda-11.1 I installed my env by pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

cs-wangfeng commented 5 months ago

The issue was resolved by rolling back the Python version to 3.7.
The pip version doesn't influence the installing of apex

AlaaAlmutawa commented 3 months ago

I am having the same issue. How did you fix it? unfortunately nothing is working. Python version is 3.7

JeremySun1224 commented 1 month ago

May I ask why this bug has not been fixed yet