Closed wenshuaishuai123 closed 4 months ago
您好,请先参考如下readme中的办法检查您的设备是910A还是910B。 https://github.com/PaddlePaddle/PaddleCustomDevice/blob/release/2.6/backends/npu/README_cn.md
其次,请您这里提供Paddle和版本信息,请允许如下命令输出
python -c "import paddle; paddle.version.show()"
python -c "import paddle_custom_device; paddle_custom_device.npu.version()"
根据报错,错误是发生在昇腾CANN软件栈的内部,请参考如下文档 https://www.hiascend.com/document/detail/zh/canncommercial/700/reference/envvar/envref_07_0105.html?/zh/canncommercial/700/reference/envvar/envref_07_0105.html?/zh/canncommercial/700/reference/envvar/envref_07_0105.html
运行如下步骤定位CANN的具体报错是哪个算子
# 建议错误定位步骤如下
# 0. 先清空 /root/ascend/log/debug/ 目录下的所有日志文件
rm -rf /root/ascend/log/debug/*
# 1. 打开 INFO Level 日志,或者 DEBUG Level
export ASCEND_GLOBAL_LOG_LEVEL=1
# 2. 重复运行错误程序直到抛出 ACL 错误
python xxx.py
# 3. 查看日志文件中的 ERROR 前后的信息,定位错误提示
grep ERROR /root/ascend/log/debug/plog/*.log -C 20
# 如果没有明显信息提示,可以尝试
# 联系华为技术微信群中的华为技术支持寻求帮助
# 在昇腾 https://gitee.com/ascend/modelzoo/issues 提问
谢谢回复,我用910b3,cann8.0,是可以训练,训练loss日志可以正常出现,但时间太长了,上面显示时间是710day才可以训练完成,估计是底层硬件的原因,我用pytorch版本的rtdet也是速度慢但比paddle的要快,主要想问下,官方有没有适配好的训练正常的模型,例如paddledetection里面的哪些模型是可用,哪些不可用的文档
近期飞桨3.0beta发版的时候,会在官网发布昇腾910B支持PaddleDetection的官方文档,这个预计要等到6月底或者7月份, 具体可以联系下飞桨硬件生态产品负责人王凯 @onecatcn ,谢谢!
版本是 paddle2.6.1 cann=7.0.0 910PremiumA
main()
File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 179, in main
run(FLAGS, cfg)
File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 135, in run
trainer.train(FLAGS.eval)
File "/home/ma-user/work/rtdetr_paddle/ppdet/engine/trainer.py", line 377, in train
outputs = model(data)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, *kwargs)
File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/meta_arch.py", line 60, in forward
out = self.get_loss()
File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/detr.py", line 113, in get_loss
return self._forward()
File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/architectures/detr.py", line 87, in _forward
out_transformer = self.transformer(body_feats, pad_mask, self.inputs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(inputs, **kwargs)
File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/transformers/rtdetr_transformer.py", line 419, in forward
get_contrastive_denoising_training_group(gt_meta,
File "/home/ma-user/work/rtdetr_paddle/ppdet/modeling/transformers/utils.py", line 337, in get_contrastive_denoising_training_group
attn_mask[num_denoising:, :num_denoising] = True
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 897, in setitem
return self._setitem_dygraph(item, value)
OSError: (External) ACL error, the error code is : 500002. (at /home/ma-user/work/ascend/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:626)
验证环境都成功,无报错 https://github.com/lyuwenyu/RT-DETR 源代码 但训练rtdetr官方代码报错 下面是报错日志 Traceback (most recent call last): File "/home/ma-user/work/rtdetr_paddle/tools/train.py", line 183, in