PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.81k stars 2.89k forks source link

rtdetr训练错误 #8324

Open hahapt opened 1 year ago

hahapt commented 1 year ago

问题确认 Search before asking

请提出你的问题 Please ask your question

在rtdetr训练过程中,频发以下类似报错。 且该报错往往发生在模型已经完成几个epoch训练,且已完成evaluate之后。每次重新训练模型报错发生时间均不相同。 已下调学习率,但仍然会出现这个报错。 File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/losses/detr_loss.py", line 290, in _get_prediction_loss boxes, logits, gt_bbox, gt_class, masks=masks, gt_mask=gt_mask) File "/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__ return self.forward(*inputs, **kwargs) File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/transformers/matchers.py", line 180, in forward for i, c in enumerate(C) File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/transformers/matchers.py", line 180, in <listcomp> for i, c in enumerate(C) File "/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment return _lsap_module.calculate_assignment(cost_matrix) ValueError: matrix contains invalid numeric entries I0606 18:50:15.434149 270 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop

lyuwenyu commented 1 year ago

用的是自己数据集还是coco数据集,是稳定复现的还是随机的

hahapt commented 1 year ago

用的是自己数据集还是coco数据集,是稳定复现的还是随机的

自有数据集。自有数据集确认训练ppyoloe plus模型没有问题,训练detr每次都能出现该问题。

lyuwenyu commented 1 year ago

用的是自己数据集还是coco数据集,是稳定复现的还是随机的

自有数据集。自有数据集确认训练ppyoloe plus模型没有问题,训练detr每次都能出现该问题。

"/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment return _lsap_module.calculate_assignment(cost_matrix) ValueError: matrix contains invalid numeric entries I0606. 这里显示是matrix里有非数元素 建议把报错的case和矩阵都打印出来 看下是那些输入导致的cost matrix里的值有问题 ;或者是 判断一下矩阵里是不是有np.inf 把他赋值成一个很大的数 屏蔽这个情况

vscv commented 1 year ago

更新,後來把amp取消,就沒報ValueError: matrix contains invalid numeric entries了。


試著印出錯誤的矩陣的確有一個負值,但不明白此處矩陣的意義無法確認是否是因為該負值才報錯。 往上觀察gt_bbox發現x2y2小於x1y1有問題。不過該資料集在製作時有特別檢查這個規則以避免這個情況發生,而且是跑了epcoh 53之後才發生,所以還是找不出問題點。

在matchers.py增加打印:

    if hasattr(paddle.Tensor, "contiguous"):
         indices = [
                    linear_sum_assignment(c.split(sizes, -1)[i].contiguous().numpy())
                    for i, c in enumerate(C)
                ]
     else:
         print(f'***   QQ   ***     gt_bbox: {gt_bbox} \n gt_class: {gt_class}')
         for i, c in enumerate(C):
              print(f'***   OO   ***     \n\t c.split(sizes, -1)[i].numpy():i {i} {c.split(sizes, -1)[i].numpy()}')
          indices = [
                    linear_sum_assignment(c.split(sizes, -1)[i].numpy())
                    for i, c in enumerate(C)
                ]

ValueError: matrix contains invalid numeric entries 錯誤發生在epoch 53:

***   QQ   ***
gt_bbox: [Tensor(shape=[2, 4], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [[0.48447210, 0.41217566, 0.42945877, 0.56886226],
        [0.49157053, 0.74950099, 0.09228036, 0.16966069]])] <-------------x2y2小於x1y1有問題
 gt_class: [Tensor(shape=[2, 1], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       [[7],
        [6]])]

***   OO   ***
c.split(sizes, -1)[i].numpy():i 0 [[ 1.7599837 11.026568 ]
 [ 2.049833  11.181454 ]
 [ 2.4996424 11.26818  ]
 [ 6.355298  11.365687 ]
 [ 1.692439  11.049711 ]
 [ 2.0194209 11.251497 ]
 [ 2.2688642 11.072475 ]
 [ 2.333468  11.165276 ]
 [ 3.0569003 11.168724 ]
 [ 2.5324612 11.288134 ]
 [ 2.8724706 11.145861 ]
 [ 2.7869587 11.19514  ]
 [ 2.5074916 11.296212 ]
 [ 1.949705  11.093303 ]
 [ 2.307794  11.305828 ]
 [ 2.5112672 11.28352  ]
 [ 2.5825143 11.168924 ]
 [ 2.4173157 11.317762 ]
 [ 2.829239  11.459548 ]
 [ 2.463025  11.292694 ]
 [ 5.9910507 12.661864 ]
 [-0.3339162 10.511591 ] <-----------負值
 [ 2.15493   11.168022 ]
 [ 3.366632  11.2042055]
 [ 4.110814  11.2512045]
 [ 2.3464606 11.198448 ]
 [ 5.542488  12.517757 ]
 [ 2.8630867 11.139787 ]
 [ 2.7284026 11.309966 ]
 [ 2.6141758 11.299907 ]
 [ 2.4727893 11.152569 ]
 [ 4.20058   11.511411 ]
 [ 1.8966274 10.99998  ]
 [ 9.708992   5.4140797]
 [ 3.5940661 11.568753 ]
 [ 6.281913  12.796988 ]
 [ 2.9845026 11.159372 ]
 [ 3.3426726 11.074744 ]
 [ 3.6937804 11.236153 ]
 [ 3.355887  11.459285 ]
 [ 4.2713823 11.552409 ]
 [ 3.9641528 11.246575 ]
 [ 2.9034853 11.220213 ]
 [ 3.846718  11.395056 ]
 [ 2.8634353 11.211897 ]
 [ 4.0872545 11.33564  ]
 [ 3.473411  11.140634 ]
 [ 7.6876974 13.776929 ]
 [ 3.878459  11.607999 ]
 [ 3.4515805 11.313853 ]
 [ 6.4159074 13.104409 ]
 [ 3.1482599 11.084245 ]
 [ 2.1007001 11.254607 ]
 [ 2.799687  11.298661 ]
 [ 2.5385547 11.490654 ]
 [ 4.802795  12.033941 ]
 [ 7.232189  14.003556 ]
 [ 3.4860215 11.33117  ]
 [ 2.6439528 11.395288 ]
 [ 1.4982738 10.998075 ]
 [ 3.2426548 11.063569 ]
 [ 5.338269  12.270138 ]
 [ 2.6799831 11.384036 ]
 [ 2.9196317 11.101437 ]
 [ 3.73248   11.336153 ]
 [ 3.9087949 11.058043 ]
 [ 3.719605  11.254088 ]
 [ 5.65514   12.694899 ]
 [ 4.037599  11.160446 ]
 [ 2.53822   11.23583  ]
 [ 2.5598032 11.218611 ]
 [10.2023325  5.367098 ]
 [ 3.5717287 11.516316 ]
 [ 3.921412  11.755077 ]
 [ 4.070634  11.204057 ]
 [ 6.3119807 13.291925 ]
 [ 3.1996584 11.241343 ]
 [ 3.260804  11.244022 ]
 [ 2.413741  11.34825  ]
 [ 3.967761  11.155499 ]
 [ 3.8211899 11.310341 ]
 [ 8.287544  14.342397 ]
 [ 3.327565  11.329406 ]
 [ 3.700471  11.146326 ]
 [ 3.6463447 11.352091 ]
 [ 3.2741437 11.269228 ]
 [ 3.9584026 11.106131 ]
 [ 6.7538757 11.3064995]
 [ 2.7563558 11.177328 ]
 [ 4.099855  11.284094 ]
 [ 2.454149  11.213313 ]
 [ 2.9564834 11.105095 ]
 [ 3.314466  11.113383 ]
 [ 1.9595315 11.195273 ]
 [ 3.9491303 11.890509 ]
 [ 4.0707936 11.39089  ]
 [ 2.9190874 10.892869 ]
 [ 3.3545637 11.168913 ]
 [ 4.4103    11.181365 ]
 [ 3.8460164 11.089968 ]
 [ 4.1469483 11.413539 ]
 [ 3.4575093 11.063723 ]
 [ 6.0131817 11.509391 ]
 [ 4.1197085 11.163555 ]
 [ 2.7775195 11.190386 ]
 [ 4.429281  11.460817 ]
 [ 8.390841  14.874726 ]
 [ 3.235461  11.2072315]
 [ 4.235032  11.44532  ]
 [ 2.995945  11.226589 ]
 [ 3.3277357 11.243403 ]
 [ 3.3108225 11.218366 ]
 [ 3.3176475 11.35452  ]
 [ 2.109938  11.236705 ]
 [ 3.002455  11.343922 ]
 [ 3.436967  11.26872  ]
 [ 3.2733667 11.137366 ]
 [ 2.9385054 11.335537 ]
 [ 4.162654  11.348904 ]
 [ 4.5524693 11.283983 ]
 [ 3.338965  11.596802 ]
 [ 3.3358197 10.902061 ]
 [ 3.2375433 11.431944 ]
 [ 3.8851342 11.228032 ]
 [ 3.832893  11.102856 ]
 [ 3.0639284 11.184585 ]
 [ 5.4026403 11.872373 ]
 [ 3.436244  11.339251 ]
 [ 4.1002192 10.538921 ]
 [ 4.8037586 11.938636 ]
 [ 2.4958496 11.115103 ]
 [ 4.387678  11.39821  ]
 [ 6.6449327 11.151266 ]
 [ 3.29677   11.21484  ]
 [ 4.122499  11.154516 ]
 [ 1.7930413 11.0290985]
 [ 5.5800824 10.022876 ]
 [ 9.71553    5.4919987]
 [ 3.5139241 10.894391 ]
 [ 7.632844  13.533806 ]
 [ 6.8210125 12.725449 ]
 [ 7.511883  14.133621 ]
 [ 8.895927  13.728383 ]
 [ 5.1962295 11.429665 ]
 [ 3.903194  11.177962 ]
 [ 3.860395  11.177652 ]
 [ 3.9986382 11.306002 ]
 [ 4.363597  11.834767 ]
 [ 4.7169952 11.365067 ]
 [ 2.5398011 11.24234  ]
 [ 4.3293076 11.04855  ]
 [ 6.6717644 12.498488 ]
 [ 4.467716  11.522428 ]
 [ 4.5951586 11.6574955]
 [ 9.624587   5.764948 ]
 [ 3.0008311 11.074772 ]
 [ 2.8261948 11.079855 ]
 [ 4.657835  11.048415 ]
 [ 9.982653   5.421173 ]
 [ 3.5921469 10.933281 ]
 [ 5.834969  11.811183 ]
 [ 3.6022296 11.387233 ]
 [ 3.9883478 11.266834 ]
 [ 4.1219406 11.881109 ]
 [ 8.280401   7.0490394]
 [ 5.1920485 12.072052 ]
 [ 9.924583  14.969906 ]
LAUNCH INFO 2023-10-07 09:06:01,443 Pod failed
[2023-10-07 09:06:01,443] [    INFO] controller.py:115 - Pod failed
LAUNCH ERROR 2023-10-07 09:06:01,443 Container failed !!!