PaddlePaddle / PaddleCustomDevice

PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)
Apache License 2.0
70 stars 148 forks source link

[NPU] PaddleDetection的ppyolo_r50d_dcn网络模型训练半截报错退出 #706

Closed 535205856 closed 5 months ago

535205856 commented 1 year ago

宿主机机器环境是 昇腾910npu + 鲲鹏920 arm cpu 的 ubuntu 环境 镜像使用是npu文档中的镜像 registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-aarch64-gcc82 paddleDetection用的v2.6.0版本

网络模型用的 ppyolo_r50d_dcn 数据集用的 roadsign

python tools/train.py -c configs/ppyolo/ppyolo_r50d_dcn_roadsign.yml -o use_npu=True

训练半截报错

--------------------------------------分割线 训练窗口的报错日志 [07/20 21:41:52] ppdet.engine INFO: Epoch: [1] [19/58] learning_rate: 0.000064 loss_xy: 1.350998 loss_wh: 5.382170 loss_iou: 5.030044 loss_iou_aware: 0.926509 loss_obj: 4009.031738 loss_cls: 4.006392 loss: 4025.727783 eta: 45 days, 12:14:19 batch_cost: 0.6057 data_cost: 0.0006 ips: 19.8111 images/s [07/20 21:42:33] ppdet.engine INFO: Epoch: [1] [20/58] learning_rate: 0.000065 loss_xy: 3.250648 loss_wh: nan loss_iou: 7.106074 loss_iou_aware: nan loss_obj: nan loss_cls: nan loss: nan eta: 44 days, 22:28:24 batch_cost: 0.7804 data_cost: 0.0005 ips: 15.3774 images/s Call aclrtSynchronizeStream(reinterpret_cast(stream)) failed : 507015 at file /workspace/PaddleCustomDevice/backends/npu/runtime/runtime.cc line 408 E10404: Output indexed [0] requires a 18446744073709551615 buffer, but 589856 (aligned) are allocated. Solution: Check whether the data type, dimensions, and shape are correctly set. For details, see the aclGetTensorDescSize API description in AscendCL API Reference. TraceBack (most recent call last): [Exec][Op]Execute op failed. ge result = 145000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] Output indexed [0] requires a 18446744073709551615 buffer, but 36896 (aligned) are allocated. Output indexed [0] requires a 18446744073709551615 buffer, but 147488 (aligned) are allocated. Output indexed [0] requires a 18446744073709551615 buffer, but 278816 (aligned) are allocated. Output indexed [0] requires a 18446744073709551615 buffer, but 17472 (aligned) are allocated.

----------------------------------------分割线 cann 错误日志 [ERROR] ASCENDCL(1082179,python):2023-07-20-21:43:13.555.217 [stream.cpp:104]1082179 aclrtSynchronizeStream: [FINAL][FINAL]synchronize stream failed, runtime result = 507015 [ERROR] RUNTIME(1082179,python):2023-07-20-21:43:13.561.365 [device_msg_handler.cc:156]1082179 HandleMsgInHostBuf:[FINAL][FINAL] DEVICE[0] PID[1082179]: EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-20-11:11:46.148.823, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-2 [ERROR] RUNTIME(1082179,python):2023-07-20-21:43:13.561.374 [device_msg_handler.cc:156]1082179 HandleMsgInHostBuf:[FINAL][FINAL]0-11:11:46.148.857, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-20-11:11:46.148.867, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=7

YanhuiDua commented 1 year ago

你好,需要提供更多报错信息,而且模型在第20个step训练出nan了,有几个建议可以尝试下

  1. export FLAGS_call_stack_level=3,把报错处的C++调用栈打印出来,看下出错在哪个kernel;如果能够找到出错的算子,尝试下export CUSTOM_DEVICE_BLACK_LIST=op1,把这个算子fallback到cpu跑,看下是否会出错
  2. CANN的log 需要 grep ERROR -C 40 提供更多报错信息
qili93 commented 5 months ago

您好,请问这个问题是否依旧解决,谢谢!

qili93 commented 5 months ago

Close as no more comments for more then two weeks, please reopen if not resolved, thanks!