PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.14k stars 5.56k forks source link

PaddleCheckError: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3 #22467

Closed brbheart closed 4 years ago

brbheart commented 4 years ago

Traceback (most recent call last): File "train.py", line 777, in train(args) File "train.py", line 164, in train place = fluid.CUDAPlace(0) paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::platform::GetCUDADeviceCount()


Error Message Summary:

PaddleCheckError: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: initialization error at [/paddle/paddle/fluid/platform/gpu_info.cc:67]

hong19860320 commented 4 years ago

登录到用户的PaddleCloud机器,在比较两次训练任务的日志后发现,两次任务用的不是同一个GPU devices,因此,初步判断是由于GPU的环境问题导致该情况发生,后者可能没有GPU,或者它的CUDA和cudnn环境不满足paddle的CUDA10和CUDNN 7.6的要求,目前PaddleCloud的@任文彬同学正在跟进该问题。

hong19860320 commented 4 years ago

用户通过增大trainner卡的数量解决该问题。 image