BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
875 stars 83 forks source link

cannot find libnccl.so.2 #331

Closed lixiangMindSpore closed 3 years ago

lixiangMindSpore commented 3 years ago

Describe the bug A clear and concise description of what the bug is. image

Environment

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context Add any other context about the problem here.

NOBLES5E commented 3 years ago

Thanks for opening the issue. Bagua cannot find NCCL installation on your system in this case. Have you tried to follow the error message's instruction by running import bagua_core; bagua_core.install_deps() in your Python interpreter? It will help install needed system libraries.

lixiangMindSpore commented 3 years ago

Thanks for opening the issue. Bagua cannot find NCCL installation on your system in this case. Have you tried to follow the error message's instruction by running import bagua_core; bagua_core.install_deps() in your Python interpreter? It will help install needed system libraries.

I run bagua_install_deps.py and solve the problem. Thank you so much!

NOBLES5E commented 3 years ago

You're welcome :)

Godricly commented 2 years ago
Python 3.8.0 (default, Feb 25 2021, 22:10:10) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import bagua
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-b6bb5bf6d045> in <module>
----> 1 import bagua

~/python38/lib/python3.8/site-packages/bagua/__init__.py in <module>
     10 """
     11 
---> 12 import bagua_core  # noqa: F401
     13 from .version import __version__  # noqa: F401

~/python38/lib/python3.8/site-packages/bagua_core/__init__.py in <module>
      2 
      3 _environment._preload_libraries()
----> 4 from .bagua_core import *  # noqa: F401,E402,F403
      5 from .bagua_install_deps import install_deps  # noqa: F401,E402,F403

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

I got the same error with bagua-cuda116 using virtualenv. running bagua_install_deps.py failed for me.

bagua_install_deps.py 
import-im6.q16: not authorized `os' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `platform' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `shutil' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `tempfile' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `pathlib' @ error/constitute.c/WriteImage/1037.
from: too many arguments
/home/xxx/python38/bin/bagua_install_deps.py: line 10: _nccl_records: command not found
/home/xxx/python38/bin/bagua_install_deps.py: line 11: library_records: command not found
/home/xxx/python38/bin/bagua_install_deps.py: line 14: syntax error near unexpected token `('
/home/xxx/python38/bin/bagua_install_deps.py: line 14: `class DownloadProgressBar(tqdm):'
Godricly commented 2 years ago

bagua-cuda116 was built differently with other cuda release.

bagua-cuda116                 0.8.3.dev215
woqidaideshi commented 2 years ago

@Godricly Which python version did you use to run bagua_install_deps.py?

Maybe you can try: python3 bagua_install_deps.py?

Godricly commented 2 years ago

I tried on an other machine with cuda113 and nccl, which works well for me. I think the problem is that nccl is not installed. Also that bagua-cuda116 version should be updated.