Open yidu0924 opened 1 month ago
一开始试的是 registry.baidubce.com/paddlepaddle/paddle:3.0.0b0-gpu-cuda11.8-cudnn8.6-trt8.5 也是8卡跑不通,就换成两个cuda=12.0的版本还是跑不通
你用的那个whl包
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个
我是直接用docker的,因为不想再配一遍本地环境
使用的docker是官方的3.0和两个2.6的版本,一开始用的是3.0的跑不了怀疑是cuda版本不匹配就采用两个2.6的还是无法通过run check
现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应
你的docker里可以把这个给卸载了,然后装我发给你的这个
现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应
这个估计得让分布式方向的RD看一下了
请提出你的问题 Please ask your question
报错如下 [2024-07-12 08:34:51,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 8 GPUs. This may be caused by:
本地环境: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A800-SXM... On | 00000000:3D:00.0 Off | 0 | | N/A 35C P0 68W / 400W | 6553MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A800-SXM... On | 00000000:42:00.0 Off | 0 | | N/A 30C P0 62W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A800-SXM... On | 00000000:61:00.0 Off | 0 | | N/A 30C P0 60W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A800-SXM... On | 00000000:67:00.0 Off | 0 | | N/A 35C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 | | N/A 34C P0 60W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 | | N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 | | N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 | | N/A 34C P0 65W / 400W | 3MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Repository: registry.baidubce.com/paddlepaddle/paddle paddlepaddle/paddle
两个都试过,报错都是一样的。