PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

Docker paddle 无法通过run_check()测试 #65994

Open yidu0924 opened 1 month ago

yidu0924 commented 1 month ago

请提出你的问题 Please ask your question

报错如下 [2024-07-12 08:34:51,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 8 GPUs. This may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html [2024-07-12 08:34:51,881] [ WARNING] install_check.py:297 - Original Error is: Process 6 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now. Traceback (most recent call last): File "/home/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn while not context.join(): File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join self._throw_exception(error_index) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception raise Exception( Exception: Process 6 terminated with exit code 1.

本地环境: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A800-SXM... On | 00000000:3D:00.0 Off | 0 | | N/A 35C P0 68W / 400W | 6553MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A800-SXM... On | 00000000:42:00.0 Off | 0 | | N/A 30C P0 62W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A800-SXM... On | 00000000:61:00.0 Off | 0 | | N/A 30C P0 60W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A800-SXM... On | 00000000:67:00.0 Off | 0 | | N/A 35C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 | | N/A 34C P0 60W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 | | N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 | | N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 | | N/A 34C P0 65W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

Repository: registry.baidubce.com/paddlepaddle/paddle paddlepaddle/paddle
两个都试过,报错都是一样的。

yidu0924 commented 1 month ago

一开始试的是 registry.baidubce.com/paddlepaddle/paddle:3.0.0b0-gpu-cuda11.8-cudnn8.6-trt8.5 也是8卡跑不通,就换成两个cuda=12.0的版本还是跑不通

risemeup1 commented 1 month ago

你用的那个whl包

risemeup1 commented 1 month ago

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

yidu0924 commented 1 month ago

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

我是直接用docker的,因为不想再配一遍本地环境

yidu0924 commented 1 month ago

使用的docker是官方的3.0和两个2.6的版本,一开始用的是3.0的跑不了怀疑是cuda版本不匹配就采用两个2.6的还是无法通过run check

yidu0924 commented 1 month ago

现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应

risemeup1 commented 1 month ago

你的docker里可以把这个给卸载了,然后装我发给你的这个

risemeup1 commented 1 month ago

现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应

这个估计得让分布式方向的RD看一下了