Closed brbheart closed 4 years ago
Traceback (most recent call last): File "train.py", line 777, in train(args) File "train.py", line 164, in train place = fluid.CUDAPlace(0) paddle.fluid.core_avx.EnforceNotMet:
0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::platform::GetCUDADeviceCount()
PaddleCheckError: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: initialization error at [/paddle/paddle/fluid/platform/gpu_info.cc:67]
登录到用户的PaddleCloud机器,在比较两次训练任务的日志后发现,两次任务用的不是同一个GPU devices,因此,初步判断是由于GPU的环境问题导致该情况发生,后者可能没有GPU,或者它的CUDA和cudnn环境不满足paddle的CUDA10和CUDNN 7.6的要求,目前PaddleCloud的@任文彬同学正在跟进该问题。
用户通过增大trainner卡的数量解决该问题。
Traceback (most recent call last): File "train.py", line 777, in
train(args)
File "train.py", line 164, in train
place = fluid.CUDAPlace(0)
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::platform::GetCUDADeviceCount()
Error Message Summary:
PaddleCheckError: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: initialization error at [/paddle/paddle/fluid/platform/gpu_info.cc:67]