PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.14k stars 5.56k forks source link

训练完成,在推理阶段发生(External) CUDA error(719), unspecified launch failure. #45742

Open ABC1234-gitup opened 2 years ago

ABC1234-gitup commented 2 years ago

bug描述 Describe the Bug

(python_env) D:\HYD\PaddleDetection-release-2.3>python -u ./tools/eval.py -c configs/centernet/centernet_r50_140e_coco.yml -o weights=./work/output/centernet_r50_140e_coco/0.pdparams Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly W0905 10:25:07.250306 7876 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.7, Runtime API Version: 11.6 W0905 10:25:07.253298 7876 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. loading annotations into memory... Done (t=0.00s) creating index... index created! [09/05 10:25:08] ppdet.utils.checkpoint INFO: Finish loading model weights: ./work/output/centernet_r50_140e_coco/0.pdparams Traceback (most recent call last): File "./tools/eval.py", line 152, in main() File "./tools/eval.py", line 148, in main run(FLAGS, cfg) File "./tools/eval.py", line 106, in run trainer.evaluate() File "D:\HYD\PaddleDetection-release-2.3\ppdet\engine\trainer.py", line 503, in evaluate self._eval_with_loader(self.loader) File "D:\HYD\PaddleDetection-release-2.3\ppdet\engine\trainer.py", line 481, in _eval_with_loader outs = self.model(data) File "C:\Users\user.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, *kwargs) File "C:\Users\user.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(inputs, **kwargs) File "D:\HYD\PaddleDetection-release-2.3\ppdet\modeling\architectures\meta_arch.py", line 56, in forward out = self.get_pred() File "D:\HYD\PaddleDetection-release-2.3\ppdet\modeling\architectures\centernet.py", line 100, in get_pred scale_factor=self.inputs['scale_factor']) File "D:\HYD\PaddleDetection-release-2.3\ppdet\modeling\post_process.py", line 464, in call scores, inds, topk_clses, ys, xs = self._topk(heat) File "D:\HYD\PaddleDetection-release-2.3\ppdet\modeling\layers.py", line 815, in _topk topk_xs = topk_inds % width File "C:\Users\user.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\gpu_context.cc:435) [operator < elementwise_mod > error]

其他补充信息 Additional Supplementary Information

环境: python 3.7.13 paddlepaddle-gpu 2.3.2 cudatoolkit 11.6.0
cudnn 8.4.1.50 window 10系统 我的虚拟环境中除了paddlepaddle-gpu,还有pytorch 1.12.1

paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

wangxinxin08 commented 2 years ago

你使用PaddleDetection官方的权重: https://bj.bcebos.com/v1/paddledet/models/centernet_r50_140e_coco.pdparams 试下能不能正常eval呢?如果依然不能正常eval,建议使用docker或者安装cuda 11.6.2和cudnn 8.4.0这个组合

ABC1234-gitup commented 2 years ago

改变权重后出现同样的错误,我选择安装cuda 11.6.2和cudnn 8.4.0这个组合,在环境里我先装的pytorch,下载cudnn-windows-x86_64-8.4.0.27_cuda11.6-archive。pytorch安装完成后安装paddlepaddle,使用官网命令在安装过程中会自动安装cudnn 8.4.1这个版本,请问在命令安装paddlepaddle时可以指定cudnn的版本号吗?

------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle" @.>; 发送时间: 2022年9月5日(星期一) 下午4:49 @.>; @.**@.>; 主题: Re: [PaddlePaddle/Paddle] 训练完成,在推理阶段发生(External) CUDA error(719), unspecified launch failure. (Issue #45742)

你使用PaddleDetection官方的权重: https://bj.bcebos.com/v1/paddledet/models/centernet_r50_140e_coco.pdparams 试下能不能正常eval呢?如果依然不能正常eval,建议使用docker或者安装cuda 11.6.2和cudnn 8.4.0这个组合

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wangxinxin08 commented 2 years ago

不能指定cudnn的版本号,你可以先试下现在还有问题没有

ABC1234-gitup commented 2 years ago

我试了下还是出现同样的问题。 Traceback (most recent call last):   File "./tools/eval.py", line 153, in <module>     main()   File "./tools/eval.py", line 148, in main     run(FLAGS, cfg)   File "./tools/eval.py", line 106, in run     trainer.evaluate()   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\engine\trainer.py", line 503, in evaluate     self._eval_with_loader(self.loader)   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\engine\trainer.py", line 481, in _eval_with_loader     outs = self.model(data)   File "C:\Users\HENGIDEAL.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call     return self._dygraph_call_func(*inputs, *kwargs)   File "C:\Users\HENGIDEAL.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func     outputs = self.forward(inputs, **kwargs)   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\modeling\architectures\meta_arch.py", line 56, in forward     out = self.get_pred()   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\modeling\architectures\centernet.py", line 100, in get_pred     scale_factor=self.inputs['scale_factor'])   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\modeling\post_process.py", line 464, in call     scores, inds, topk_clses, ys, xs = self._topk(heat)   File "D:\HYD\95-99\PaddleDetection-release-2.3\ppdet\modeling\layers.py", line 815, in _topk     topk_xs = topk_inds % width   File "C:\Users\HENGIDEAL.conda\envs\python_env\lib\site-packages\paddle\fluid\dygraph\math_op_patch.py", line 299, in impl     return math_op(self, other_var, 'axis', axis) OSError: (External) CUDA error(719), unspecified launch failure.   [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\gpu_context.cc:435)   [operator < elementwise_mod > error]

------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle" @.>; 发送时间: 2022年9月6日(星期二) 下午4:15 @.>; @.**@.>; 主题: Re: [PaddlePaddle/Paddle] 训练完成,在推理阶段发生(External) CUDA error(719), unspecified launch failure. (Issue #45742)

不能指定cudnn的版本号,你可以先试下现在还有问题没有

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wangxinxin08 commented 2 years ago

建议回退cuda版本到11.2或者10.2,太新的cuda版本可能出问题

wangxinxin08 commented 2 years ago

另外,可以测试下以下代码能够正常跑

import paddle
scores = paddle.randn((1, 2, 3, 3))
k = 10
shape_fm = paddle.shape(scores)
shape_fm.stop_gradient = True
cat, height, width = shape_fm[1], shape_fm[2], shape_fm[3]
scores_r = paddle.reshape(scores, [cat, -1])
topk_scores, topk_inds = paddle.topk(scores_r, k)
topk_ys = topk_inds // width
topk_xs = topk_inds % width
ABC1234-gitup commented 2 years ago

不可以正常跑。会出现以下错误: Error: ../paddle/phi/kernels/funcs/elementwise_functor.h:545 Assertion b != 0 failed. InvalidArgumentError: Integer division by zero encountered in (floor) divide. Please check the input value.

------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle" @.>; 发送时间: 2022年9月6日(星期二) 下午5:49 @.>; @.**@.>; 主题: Re: [PaddlePaddle/Paddle] 训练完成,在推理阶段发生(External) CUDA error(719), unspecified launch failure. (Issue #45742)

另外,可以测试下以下代码能够正常跑 import paddle scores = paddle.randn((1, 2, 3, 3)) k = 10 shape_fm = paddle.shape(scores) shape_fm.stop_gradient = True cat, height, width = shape_fm[1], shape_fm[2], shape_fm[3] scores_r = paddle.reshape(scores, [cat, -1]) topk_scores, topk_inds = paddle.topk(scores_r, k) topk_ys = topk_inds // width topk_xs = topk_inds % width

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wangxinxin08 commented 2 years ago

建议先回退CUDA版本吧

ABC1234-gitup commented 2 years ago

cuda 退回到10.2版本,问题解决。十分感谢!