PaddlePaddle / PaddleCustomDevice

PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)
Apache License 2.0
70 stars 148 forks source link

训练出错OSError: (External) ACL error, the error code is : 500002. (at /home/ma-user/work/ascend/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:626) #1293

Closed wenshuaishuai123 closed 4 months ago

wenshuaishuai123 commented 5 months ago

版本是 paddle2.6.1 cann=7.0.0 910PremiumA
验证环境都成功,无报错 1717749367448_CD8001EB-2DC3-4cce-B6DB-AEA24559A155 https://github.com/lyuwenyu/RT-DETR 源代码 但训练rtdetr官方代码报错 下面是报错日志 Traceback (most recent call last): File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 183, in main() File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 179, in main run(FLAGS, cfg) File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 135, in run trainer.train(FLAGS.eval) File "/home/ma-user/work/rtdetr_paddle/ppdet/engine/trainer.py", line 377, in train outputs = model(data) File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, *kwargs) File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/detr.py", line 113, in get_loss return self._forward() File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/detr.py", line 87, in _forward out_transformer = self.transformer(body_feats, pad_mask, self.inputs) File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(inputs, **kwargs) File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/transformers/rtdetr_transformer.py", line 419, in forward get_contrastive_denoising_training_group(gt_meta, File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/transformers/utils.py", line 337, in get_contrastive_denoising_training_group attn_mask[num_denoising:, :num_denoising] = True File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 897, in setitem return self._setitem_dygraph(item, value) OSError: (External) ACL error, the error code is : 500002. (at /home/ma-user/work/ascend/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:626)

qili93 commented 5 months ago

您好,请先参考如下readme中的办法检查您的设备是910A还是910B。 https://github.com/PaddlePaddle/PaddleCustomDevice/blob/release/2.6/backends/npu/README_cn.md

image

其次,请您这里提供Paddle和版本信息,请允许如下命令输出

python -c "import paddle; paddle.version.show()"
python -c "import paddle_custom_device; paddle_custom_device.npu.version()"

根据报错,错误是发生在昇腾CANN软件栈的内部,请参考如下文档 https://www.hiascend.com/document/detail/zh/canncommercial/700/reference/envvar/envref_07_0105.html?/zh/canncommercial/700/reference/envvar/envref_07_0105.html?/zh/canncommercial/700/reference/envvar/envref_07_0105.html

运行如下步骤定位CANN的具体报错是哪个算子

# 建议错误定位步骤如下
# 0. 先清空 /root/ascend/log/debug/ 目录下的所有日志文件
rm -rf /root/ascend/log/debug/*
# 1. 打开 INFO Level 日志,或者 DEBUG Level
export ASCEND_GLOBAL_LOG_LEVEL=1
# 2. 重复运行错误程序直到抛出 ACL 错误
python xxx.py
# 3. 查看日志文件中的 ERROR 前后的信息,定位错误提示
grep ERROR /root/ascend/log/debug/plog/*.log -C 20
# 如果没有明显信息提示,可以尝试
# 联系华为技术微信群中的华为技术支持寻求帮助
# 在昇腾 https://gitee.com/ascend/modelzoo/issues 提问
wenshuaishuai123 commented 5 months ago

谢谢回复,我用910b3,cann8.0,是可以训练,训练loss日志可以正常出现,但时间太长了,上面显示时间是710day才可以训练完成,估计是底层硬件的原因,我用pytorch版本的rtdet也是速度慢但比paddle的要快,主要想问下,官方有没有适配好的训练正常的模型,例如paddledetection里面的哪些模型是可用,哪些不可用的文档

qili93 commented 5 months ago

近期飞桨3.0beta发版的时候,会在官网发布昇腾910B支持PaddleDetection的官方文档,这个预计要等到6月底或者7月份, 具体可以联系下飞桨硬件生态产品负责人王凯 @onecatcn ,谢谢!