PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.6k forks source link

无法配置Paddle多卡环境 #64856

Open ignorejjj opened 5 months ago

ignorejjj commented 5 months ago

问题描述 Issue Description

按照官网的安装步骤进行后出现下面的错误:

Traceback (most recent call last):
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
    result = func(*args)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
    dp_layer = paddle.DataParallel(layer)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/parallel.py", line 398, in __init__
    sync_params_buffers(self._layers, fuse_params=False)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
    return func(*args, **kwargs)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
    paddle.distributed.broadcast(
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
    return stream.broadcast(
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadc
ast
    return _broadcast_in_dygraph(
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadc
ast_in_dygraph
    task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error co
de is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you inst
alled.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:312)

由于之前另一个bug, 在跑这个前对环境变量进行了设置:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

版本&环境信息 Version & Environment Information

Paddle version: 2.6.1 Paddle With CUDA: True

OS: centos 7 GCC version: (Spack GCC) 9.5.0 Clang version: N/A CMake version: N/A Libc version: glibc 2.17 Python version: 3.9.19

CUDA version: 11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0 cuDNN version: N/A Nvidia driver version: 525.60.13 Nvidia driver List: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB GPU 2: NVIDIA A800-SXM4-80GB GPU 3: NVIDIA A800-SXM4-80GB GPU 4: NVIDIA A800-SXM4-80GB GPU 5: NVIDIA A800-SXM4-80GB GPU 6: NVIDIA A800-SXM4-80GB GPU 7: NVIDIA A800-SXM4-80GB

risemeup1 commented 5 months ago

如果unsetLD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/这个变量呢?可以给我看下你的pip list有没有nvidia提供的nccl库呢?

ignorejjj commented 5 months ago

如果不设置上述变量,直接在导入paddle的时候会出现下面的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/__init__.py", line 28, in <module>
    from .base import core  # noqa: F401
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/__init__.py", line 36, in <module>
    from . import core
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/core.py", line 380, in <module>
    raise e
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/core.py", line 268, in <module>
    from . import libpaddle
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/libpaddle.so)

我的pip list如下:

aiohttp==3.9.5
aiosignal==1.3.1
aistudio-sdk==0.2.4
annotated-types==0.7.0
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1708355285029/work
astor @ file:///home/conda/feedstock_root/build_artifacts/astor_1593610464257/work
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
bce-python-sdk==0.9.11
blinker==1.8.2
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
coloredlogs==15.0.1
colorlog==6.8.2
contourpy==1.2.1
cycler==0.12.1
datasets==2.19.1
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
dill==0.3.4
distro==1.9.0
dnspython==2.6.1
email_validator==2.1.1
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
fastapi==0.111.0
fastapi-cli==0.0.4
filelock==3.14.0
Flask==3.0.3
flask-babel==4.0.0
flatbuffers==24.3.25
fonttools==4.52.1
frozenlist==1.4.1
fsspec==2024.3.1
future==1.0.0
h11 @ file:///home/conda/feedstock_root/build_artifacts/h11_1664132893548/work
h2 @ file:///home/conda/feedstock_root/build_artifacts/h2_1633502706969/work
hpack==4.0.0
httpcore @ file:///home/conda/feedstock_root/build_artifacts/httpcore_1711596990900/work
httptools==0.6.1
httpx @ file:///home/conda/feedstock_root/build_artifacts/httpx_1708530890843/work
huggingface-hub==0.23.2
humanfriendly==10.0
hyperframe @ file:///home/conda/feedstock_root/build_artifacts/hyperframe_1619110129307/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1713279365350/work
importlib_metadata==7.1.0
importlib_resources==6.4.0
itsdangerous==2.2.0
jieba==0.42.1
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.5
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.12.2
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1707225342954/work/dist/numpy-1.26.4-cp39-cp39-linux_x86_64.whl#sha256=c799942b5898f6e6c60264d1663a6469a475290e758c654aeeb78e2596463abd
onnx==1.16.1
onnxruntime==1.16.3
opt-einsum @ file:///home/conda/feedstock_root/build_artifacts/opt_einsum_1696448916724/work
orjson==3.10.3
packaging==24.0
paddle2onnx==1.2.3
paddlefsl==1.1.0
paddlenlp==2.8.0.post0
paddlepaddle-gpu==2.6.1
pandas==2.2.2
pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1712154461189/work
prettytable==3.10.0
protobuf==4.25.3
psutil==5.9.8
pyarrow==16.1.0
pyarrow-hotfix==0.6
pybind11==2.12.0
pycryptodome==3.20.0
pydantic==2.7.1
pydantic_core==2.18.2
Pygments==2.18.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
rarfile==4.2
regex==2024.5.15
requests==2.32.2
rich==13.7.1
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
seqeval==1.2.2
shellingham==1.5.4
six==1.16.0
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1708952932303/work
starlette==0.37.2
sympy==1.12.1rc1
threadpoolctl==3.5.0
tool-helpers==0.1.1
tqdm==4.66.4
typer==0.12.3
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1712329955671/work
tzdata==2024.1
ujson==5.10.0
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
visualdl==2.5.3
watchfiles==0.22.0
wcwidth==0.2.13
websockets==12.0
Werkzeug==3.0.3
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.0

似乎没有nccl。

risemeup1 commented 5 months ago

你这个安装的不是最新的包吗?现在最新版本的包已经不需要这些复杂的环境了

ignorejjj commented 5 months ago

我安装的是官网上的最新版本2.6.1. 如果有更新的版本能麻烦提供一下安装的地址吗?

risemeup1 commented 5 months ago

这个包可以直接包含所有的环境信息,而且不依赖你本地的cuda的版本,意思是即使你本地的cuda不是cuda11.8和cuda12也可以使用,你试试吧

risemeup1 commented 5 months ago

解决了吗?这个包马上就上官方文档了

ignorejjj commented 5 months ago

稍等我测试一下

ignorejjj commented 5 months ago

安装过程中一直在反复下载paddle的wheel文件,似乎是一直找不到一个合适的版本。这是正常的吗?

Downloading https://paddle-whl.bj.bcebos.com/nightly/cu120/paddlepaddle-gpu/paddlepaddle_gpu-3.0.0.dev20240527-cp39-cp39-linux_x86_64.whl (736.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 736.9/736.9 MB 4.7 MB/s eta 0:00:00
  Downloading https://paddle-whl.bj.bcebos.com/nightly/cu120/paddlepaddle-gpu/paddlepaddle_gpu-3.0.0.dev20240525-cp39-cp39-linux_x86_64.whl (736.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 736.1/736.1 MB 6.1 MB/s eta 0:00:00
  Downloading https://paddle-whl.bj.bcebos.com/nightly/cu120/paddlepaddle-gpu/paddlepaddle_gpu-3.0.0.dev20240524-cp39-cp39-linux_x86_64.whl (736.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 736.1/736.1 MB 6.4 MB/s eta 0:00:00
  Downloading https://paddle-whl.bj.bcebos.com/nightly/cu120/paddlepaddle-gpu/paddlepaddle_gpu-3.0.0.dev20240523-cp39-cp39-linux_x86_64.whl (733.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 733.6/733.6 MB 2.5 MB/s eta 0:00:00
  Downloading https://paddle-whl.bj.bcebos.com/nightly/cu120/paddlepaddle-gpu/paddlepaddle_gpu-3.0.0.dev20240522-cp39-cp39-linux_x86_64.whl (733.6 MB)
risemeup1 commented 5 months ago

稍等,我看下

risemeup1 commented 5 months ago

你试一下cuda11.8呢?也有这个问题吗?

risemeup1 commented 5 months ago

稍等,定位到问题了,我这里更新下就好了

ignorejjj commented 5 months ago

好的 等更新好了我再试试

risemeup1 commented 5 months ago

python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu120/ 好了,再试试吧

ignorejjj commented 5 months ago

好像还是有一样的问题。

risemeup1 commented 5 months ago

不会啊,我本地就可以了,还是一直安装不同版本的paddle吗?

risemeup1 commented 5 months ago
image

我这里是正常的,你再试下?

ignorejjj commented 5 months ago

我清除了之前下载的cache,重新安装,还是会出现一样的问题: image

risemeup1 commented 5 months ago

你升级下你的pip呢?

ignorejjj commented 5 months ago

pip目前就是最新的版本

risemeup1 commented 5 months ago

--no-cache-dir加这个参数试试?我这里两台机器都没问题

ignorejjj commented 5 months ago

我换了一个conda环境之后安装成功了。目前一台机器上能够通过paddle的run_check, 另一台机器上会出现下面的问题:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/__init__.py", line 33, in <module>
    from .base import core  # noqa: F401
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/__init__.py", line 38, in <module>
    from . import (  # noqa: F401
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/backward.py", line 25, in <module>
    from . import core, framework, log_helper, unique_name
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/core.py", line 384, in <module>
    raise e
  File "/fs/fast/u20238046/envs/paddle/lib/python3.9/site-packages/paddle/base/core.py", line 267, in <module>
    from . import libpaddle
ImportError: libpython3.9.so.1.0: cannot open shared object file: No such file or directory
risemeup1 commented 5 months ago

感觉是本地python环境的问题

risemeup1 commented 5 months ago

你用的是我们的镜像吗?

ignorejjj commented 5 months ago

我用的是刚刚的代码安装的。

ignorejjj commented 5 months ago

我设置了一下环境变量,现在已经正常运行了。非常感谢!

risemeup1 commented 5 months ago

那个环境变量?

ignorejjj commented 5 months ago

设置了一下这个:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/