Closed zxcd closed 7 months ago
您好,这里可能有几个问题:1、paddle安装的非分布式的版本。2、分布式环境初始化再组网之后,需要调整初始化的顺序。
您好,我是使用pip install paddlepaddle-gpu==2.4.1 安装的,应该是支持分布式的版本。在训练代码中,初始化也在比较靠前的位置https://github.com/PaddlePaddle/PaddleSpeech/blob/df37798598e8f32475892af819377101ace6d0a5/paddlespeech/s2t/training/trainer.py#L22
我也是这个问题,安装的是develop 2.5版本,单机单卡上能训练,一到多卡就卡住,训练不起来,只有0号卡日志正常,其他卡资源使用率0%,worker日志里面也是这个错误日志:
NotImplementedError: (Unimplemented) Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU, WITH_IPU, WITH_MLU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor.
Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。
bug描述 Describe the Bug
单机多卡训练时报错,worklog.0日志正常,worklog.1-worklog.3出现以下报错
NotImplementedError: (Unimplemented) Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU, WITH_IPU, WITH_MLU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor.
启动代码如下:
验证非环境问题,更换数据集后4卡均可正常运行,有可能是什么原因造成的问题?
版本尝试过paddle2.4.0, 2.4.1和develop,均为pip安装。
其他补充信息 Additional Supplementary Information
No response