PaddlePaddle / awesome-DeepLearning

深度学习入门课、资深课、特色课、学术案例、产业实践案例、深度学习知识百科及面试题库The course, case and knowledge of Deep Learning and AI
Apache License 2.0
3.1k stars 838 forks source link

多卡训练出错-paddle2.1.2 #709

Open lwnyls opened 3 years ago

lwnyls commented 3 years ago

我采用单卡训练是没问题的:python -m paddle.distributed.launch --gpus '0' train.py 但是多卡的时候:python -m paddle.distributed.launch --gpus '0,1' train.py 就会出现如下问题: NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88) [operator < gaussian_random > error] Traceback (most recent call last): File "train.py", line 204, in model = MNIST() File "train.py", line 94, in init self.conv1 = Conv2D(in_channels=1, out_channels=20, kernel_size=5, stride=1, padding=2) File "/root/anaconda3/lib/python3.8/site-packages/paddle/nn/layer/conv.py", line 633, in init super(Conv2D, self).init( File "/root/anaconda3/lib/python3.8/site-packages/paddle/nn/layer/conv.py", line 132, in init self.weight = self.create_parameter( File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 411, in create_parameter return self._helper.create_parameter(temp_attr, shape, dtype, is_bias, File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/layer_helper_base.py", line 369, in create_parameter return self.main_program.global_block().create_parameter( File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2895, in create_parameter initializer(param, self) File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/initializer.py", line 355, in call op = block.append_op( File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2921, in append_op _dygraph_tracer().trace_op(type, File "/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/dygraph/tracer.py", line 43, in trace_op self.trace(type, inputs, outputs, attrs, NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88) [operator < gaussian_random > error]

XYZ-916 commented 3 years ago

代码加入paddle.device.set_device("gpu")试试? 此外,可以看看 workerlog.*输出都是什么。