Docker paddle 无法通过run_check()测试

yidu0924 commented 1 month ago

请提出你的问题 Please ask your question

报错如下 [2024-07-12 08:34:51,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 8 GPUs. This may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html [2024-07-12 08:34:51,881] [ WARNING] install_check.py:297 - Original Error is: Process 6 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now. Traceback (most recent call last): File "/home/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn while not context.join(): File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join self._throw_exception(error_index) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception raise Exception( Exception: Process 6 terminated with exit code 1.

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

Repository: registry.baidubce.com/paddlepaddle/paddle paddlepaddle/paddle
两个都试过，报错都是一样的。

yidu0924 commented 1 month ago

一开始试的是 registry.baidubce.com/paddlepaddle/paddle:3.0.0b0-gpu-cuda11.8-cudnn8.6-trt8.5 也是8卡跑不通，就换成两个cuda=12.0的版本还是跑不通

risemeup1 commented 1 month ago

你用的那个whl包

risemeup1 commented 1 month ago

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

yidu0924 commented 1 month ago

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

我是直接用docker的，因为不想再配一遍本地环境

yidu0924 commented 1 month ago

使用的docker是官方的3.0和两个2.6的版本，一开始用的是3.0的跑不了怀疑是cuda版本不匹配就采用两个2.6的还是无法通过run check

yidu0924 commented 1 month ago

现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应

risemeup1 commented 1 month ago

你的docker里可以把这个给卸载了，然后装我发给你的这个

risemeup1 commented 1 month ago

现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应

这个估计得让分布式方向的RD看一下了

PaddlePaddle / Paddle

Docker paddle 无法通过run_check()测试 #65994

请提出你的问题 Please ask your question