--------------------------------------分割线 训练窗口的报错日志
[07/20 21:41:52] ppdet.engine INFO: Epoch: [1] [19/58] learning_rate: 0.000064 loss_xy: 1.350998 loss_wh: 5.382170 loss_iou: 5.030044 loss_iou_aware: 0.926509 loss_obj: 4009.031738 loss_cls: 4.006392 loss: 4025.727783 eta: 45 days, 12:14:19 batch_cost: 0.6057 data_cost: 0.0006 ips: 19.8111 images/s
[07/20 21:42:33] ppdet.engine INFO: Epoch: [1] [20/58] learning_rate: 0.000065 loss_xy: 3.250648 loss_wh: nan loss_iou: 7.106074 loss_iou_aware: nan loss_obj: nan loss_cls: nan loss: nan eta: 44 days, 22:28:24 batch_cost: 0.7804 data_cost: 0.0005 ips: 15.3774 images/s
Call aclrtSynchronizeStream(reinterpret_cast(stream)) failed : 507015 at file /workspace/PaddleCustomDevice/backends/npu/runtime/runtime.cc line 408
E10404: Output indexed [0] requires a 18446744073709551615 buffer, but 589856 (aligned) are allocated.
Solution: Check whether the data type, dimensions, and shape are correctly set. For details, see the aclGetTensorDescSize API description in AscendCL API Reference.
TraceBack (most recent call last):
[Exec][Op]Execute op failed. ge result = 145000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Output indexed [0] requires a 18446744073709551615 buffer, but 36896 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 147488 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 278816 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 17472 (aligned) are allocated.
宿主机机器环境是 昇腾910npu + 鲲鹏920 arm cpu 的 ubuntu 环境 镜像使用是npu文档中的镜像 registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-aarch64-gcc82 paddleDetection用的v2.6.0版本
网络模型用的 ppyolo_r50d_dcn 数据集用的 roadsign
python tools/train.py -c configs/ppyolo/ppyolo_r50d_dcn_roadsign.yml -o use_npu=True
训练半截报错
--------------------------------------分割线 训练窗口的报错日志 [07/20 21:41:52] ppdet.engine INFO: Epoch: [1] [19/58] learning_rate: 0.000064 loss_xy: 1.350998 loss_wh: 5.382170 loss_iou: 5.030044 loss_iou_aware: 0.926509 loss_obj: 4009.031738 loss_cls: 4.006392 loss: 4025.727783 eta: 45 days, 12:14:19 batch_cost: 0.6057 data_cost: 0.0006 ips: 19.8111 images/s [07/20 21:42:33] ppdet.engine INFO: Epoch: [1] [20/58] learning_rate: 0.000065 loss_xy: 3.250648 loss_wh: nan loss_iou: 7.106074 loss_iou_aware: nan loss_obj: nan loss_cls: nan loss: nan eta: 44 days, 22:28:24 batch_cost: 0.7804 data_cost: 0.0005 ips: 15.3774 images/s Call aclrtSynchronizeStream(reinterpret_cast(stream)) failed : 507015 at file /workspace/PaddleCustomDevice/backends/npu/runtime/runtime.cc line 408
E10404: Output indexed [0] requires a 18446744073709551615 buffer, but 589856 (aligned) are allocated.
Solution: Check whether the data type, dimensions, and shape are correctly set. For details, see the aclGetTensorDescSize API description in AscendCL API Reference.
TraceBack (most recent call last):
[Exec][Op]Execute op failed. ge result = 145000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Output indexed [0] requires a 18446744073709551615 buffer, but 36896 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 147488 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 278816 (aligned) are allocated.
Output indexed [0] requires a 18446744073709551615 buffer, but 17472 (aligned) are allocated.
----------------------------------------分割线 cann 错误日志 [ERROR] ASCENDCL(1082179,python):2023-07-20-21:43:13.555.217 [stream.cpp:104]1082179 aclrtSynchronizeStream: [FINAL][FINAL]synchronize stream failed, runtime result = 507015 [ERROR] RUNTIME(1082179,python):2023-07-20-21:43:13.561.365 [device_msg_handler.cc:156]1082179 HandleMsgInHostBuf:[FINAL][FINAL] DEVICE[0] PID[1082179]: EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-20-11:11:46.148.823, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-2 [ERROR] RUNTIME(1082179,python):2023-07-20-21:43:13.561.374 [device_msg_handler.cc:156]1082179 HandleMsgInHostBuf:[FINAL][FINAL]0-11:11:46.148.857, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=714 tail=713 pid=1421543 Other info[0]:time=2023-07-20-11:11:46.148.867, function=put_sq_cmd_to_stream_sq, line=1710, error code=0x94 EXCEPTION STREAM: Exception info:TGID=1421543, model id=65535, stream id=2, stream phase=3 Message info[0]:stream sq's task full(1024), head=7