PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.3k stars 5.62k forks source link

分布式训练报错 #52637

Closed zxcd closed 7 months ago

zxcd commented 1 year ago

bug描述 Describe the Bug

单机多卡训练时报错,worklog.0日志正常,worklog.1-worklog.3出现以下报错

NotImplementedError: (Unimplemented) Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU, WITH_IPU, WITH_MLU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor.

启动代码如下: image

验证非环境问题,更换数据集后4卡均可正常运行,有可能是什么原因造成的问题?

版本尝试过paddle2.4.0, 2.4.1和develop,均为pip安装。

其他补充信息 Additional Supplementary Information

No response

ForFishes commented 1 year ago

您好,这里可能有几个问题:1、paddle安装的非分布式的版本。2、分布式环境初始化再组网之后,需要调整初始化的顺序。

zxcd commented 1 year ago

您好,我是使用pip install paddlepaddle-gpu==2.4.1 安装的,应该是支持分布式的版本。在训练代码中,初始化也在比较靠前的位置https://github.com/PaddlePaddle/PaddleSpeech/blob/df37798598e8f32475892af819377101ace6d0a5/paddlespeech/s2t/training/trainer.py#L22

lemondy commented 1 year ago

我也是这个问题,安装的是develop 2.5版本,单机单卡上能训练,一到多卡就卡住,训练不起来,只有0号卡日志正常,其他卡资源使用率0%,worker日志里面也是这个错误日志:

NotImplementedError: (Unimplemented) Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU, WITH_IPU, WITH_MLU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor.

paddle-bot[bot] commented 7 months ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。