BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
872 stars 83 forks source link

Problem with AttributeError 'setuptools._distutils' has no attribute 'version') with executing MNIST example #576

Closed silverCore97 closed 2 years ago

silverCore97 commented 2 years ago

I ran the MNIST example and got the following error:

`[kqian@eu-login-04 testrun]$ python3 -m bagua.distributed.launch --nproc_per_node=8 main.py --arch resnet50 --algorithm gradient_allreduce [imagenet-folder with train and val folders]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
**AttributeError: module 'setuptools._distutils' has no attribute 'version'**
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Killing subprocess 26136
Killing subprocess 26137
Killing subprocess 26138
Killing subprocess 26140
Killing subprocess 26142
Killing subprocess 26144
Killing subprocess 26145
Killing subprocess 26146
Traceback (most recent call last):
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 342, in <module>
    main()
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 291, in sigkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/cluster/apps/nss/python/3.7.4/x86_64/bin/python3', '-u', 'main.py', '--arch', 'resnet50', '--algorithm', 'gradient_allreduce', '[imagenet-folder', 'with', 'train', 'and', 'val', 'folders]']' returned non-zero exit status 1.

`
NOBLES5E commented 2 years ago

It seems that the LooseVersion is deprecated: https://github.com/pydata/xarray/issues/6092

We have removed it from master branch. Just install the lastest master branch version to see if it works.

silverCore97 commented 2 years ago

Is it correct to use "pip install bagua-cuda113" to install the latest master branch version? I have tried it, however the problem still exists.

NOBLES5E commented 2 years ago

try python3 -m pip install --pre bagua-cuda113==0.8.3.dev187 --upgrade

silverCore97 commented 2 years ago

Now I have a different error message

Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 51, in <m            odule>
    from .distributed import BaguaModule  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/distributed.py", line 21, in             <module>
    @gorilla.patches(torch.nn.Module, filter=lambda name, obj: "bagua" in name)
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/distributed.py", line 117, i            n BaguaModule
    @property
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/torch/_jit_internal.py", line 373, in unused
    fn._torchscript_modifier = FunctionModifiers.UNUSED
AttributeError: 'property' object has no attribute '_torchscript_modifier'
Killing subprocess 10939
Traceback (most recent call last):
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 343, in <            module>
    main()
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 328, in m            ain
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 292, in s            igkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/cluster/apps/nss/python/3.7.4/x86_64/bin/python3', '-u', 'main.py', '--algorithm', 'gradient_allreduce']' returned non-zero exit status 1.
NOBLES5E commented 2 years ago

That looks weird. The CI runs the examples just fine: https://buildkite.com/bagua/bagua-gpu-test/builds/2220

Which pytorch version you are using?

silverCore97 commented 2 years ago

Was apperently a problem with the cluster I was using.

Godricly commented 2 years ago

I got same problem in gitlab-runner with bagua-cuda113. Could you please also upgrade cuda113 release to 0.9.1? The releasing history in pypi is so weird.