Open hahapt opened 1 year ago
用的是自己数据集还是coco数据集,是稳定复现的还是随机的
用的是自己数据集还是coco数据集,是稳定复现的还是随机的
自有数据集。自有数据集确认训练ppyoloe plus模型没有问题,训练detr每次都能出现该问题。
用的是自己数据集还是coco数据集,是稳定复现的还是随机的
自有数据集。自有数据集确认训练ppyoloe plus模型没有问题,训练detr每次都能出现该问题。
"/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment return _lsap_module.calculate_assignment(cost_matrix) ValueError: matrix contains invalid numeric entries I0606
. 这里显示是matrix里有非数元素 建议把报错的case和矩阵都打印出来 看下是那些输入导致的cost matrix里的值有问题 ;或者是 判断一下矩阵里是不是有np.inf
把他赋值成一个很大的数 屏蔽这个情况
更新,後來把amp取消,就沒報ValueError: matrix contains invalid numeric entries
了。
試著印出錯誤的矩陣的確有一個負值,但不明白此處矩陣的意義無法確認是否是因為該負值才報錯。 往上觀察gt_bbox發現x2y2小於x1y1有問題。不過該資料集在製作時有特別檢查這個規則以避免這個情況發生,而且是跑了epcoh 53之後才發生,所以還是找不出問題點。
在matchers.py增加打印:
if hasattr(paddle.Tensor, "contiguous"):
indices = [
linear_sum_assignment(c.split(sizes, -1)[i].contiguous().numpy())
for i, c in enumerate(C)
]
else:
print(f'*** QQ *** gt_bbox: {gt_bbox} \n gt_class: {gt_class}')
for i, c in enumerate(C):
print(f'*** OO *** \n\t c.split(sizes, -1)[i].numpy():i {i} {c.split(sizes, -1)[i].numpy()}')
indices = [
linear_sum_assignment(c.split(sizes, -1)[i].numpy())
for i, c in enumerate(C)
]
ValueError: matrix contains invalid numeric entries 錯誤發生在epoch 53:
*** QQ ***
gt_bbox: [Tensor(shape=[2, 4], dtype=float32, place=Place(gpu:0), stop_gradient=True,
[[0.48447210, 0.41217566, 0.42945877, 0.56886226],
[0.49157053, 0.74950099, 0.09228036, 0.16966069]])] <-------------x2y2小於x1y1有問題
gt_class: [Tensor(shape=[2, 1], dtype=int32, place=Place(gpu:0), stop_gradient=True,
[[7],
[6]])]
*** OO ***
c.split(sizes, -1)[i].numpy():i 0 [[ 1.7599837 11.026568 ]
[ 2.049833 11.181454 ]
[ 2.4996424 11.26818 ]
[ 6.355298 11.365687 ]
[ 1.692439 11.049711 ]
[ 2.0194209 11.251497 ]
[ 2.2688642 11.072475 ]
[ 2.333468 11.165276 ]
[ 3.0569003 11.168724 ]
[ 2.5324612 11.288134 ]
[ 2.8724706 11.145861 ]
[ 2.7869587 11.19514 ]
[ 2.5074916 11.296212 ]
[ 1.949705 11.093303 ]
[ 2.307794 11.305828 ]
[ 2.5112672 11.28352 ]
[ 2.5825143 11.168924 ]
[ 2.4173157 11.317762 ]
[ 2.829239 11.459548 ]
[ 2.463025 11.292694 ]
[ 5.9910507 12.661864 ]
[-0.3339162 10.511591 ] <-----------負值
[ 2.15493 11.168022 ]
[ 3.366632 11.2042055]
[ 4.110814 11.2512045]
[ 2.3464606 11.198448 ]
[ 5.542488 12.517757 ]
[ 2.8630867 11.139787 ]
[ 2.7284026 11.309966 ]
[ 2.6141758 11.299907 ]
[ 2.4727893 11.152569 ]
[ 4.20058 11.511411 ]
[ 1.8966274 10.99998 ]
[ 9.708992 5.4140797]
[ 3.5940661 11.568753 ]
[ 6.281913 12.796988 ]
[ 2.9845026 11.159372 ]
[ 3.3426726 11.074744 ]
[ 3.6937804 11.236153 ]
[ 3.355887 11.459285 ]
[ 4.2713823 11.552409 ]
[ 3.9641528 11.246575 ]
[ 2.9034853 11.220213 ]
[ 3.846718 11.395056 ]
[ 2.8634353 11.211897 ]
[ 4.0872545 11.33564 ]
[ 3.473411 11.140634 ]
[ 7.6876974 13.776929 ]
[ 3.878459 11.607999 ]
[ 3.4515805 11.313853 ]
[ 6.4159074 13.104409 ]
[ 3.1482599 11.084245 ]
[ 2.1007001 11.254607 ]
[ 2.799687 11.298661 ]
[ 2.5385547 11.490654 ]
[ 4.802795 12.033941 ]
[ 7.232189 14.003556 ]
[ 3.4860215 11.33117 ]
[ 2.6439528 11.395288 ]
[ 1.4982738 10.998075 ]
[ 3.2426548 11.063569 ]
[ 5.338269 12.270138 ]
[ 2.6799831 11.384036 ]
[ 2.9196317 11.101437 ]
[ 3.73248 11.336153 ]
[ 3.9087949 11.058043 ]
[ 3.719605 11.254088 ]
[ 5.65514 12.694899 ]
[ 4.037599 11.160446 ]
[ 2.53822 11.23583 ]
[ 2.5598032 11.218611 ]
[10.2023325 5.367098 ]
[ 3.5717287 11.516316 ]
[ 3.921412 11.755077 ]
[ 4.070634 11.204057 ]
[ 6.3119807 13.291925 ]
[ 3.1996584 11.241343 ]
[ 3.260804 11.244022 ]
[ 2.413741 11.34825 ]
[ 3.967761 11.155499 ]
[ 3.8211899 11.310341 ]
[ 8.287544 14.342397 ]
[ 3.327565 11.329406 ]
[ 3.700471 11.146326 ]
[ 3.6463447 11.352091 ]
[ 3.2741437 11.269228 ]
[ 3.9584026 11.106131 ]
[ 6.7538757 11.3064995]
[ 2.7563558 11.177328 ]
[ 4.099855 11.284094 ]
[ 2.454149 11.213313 ]
[ 2.9564834 11.105095 ]
[ 3.314466 11.113383 ]
[ 1.9595315 11.195273 ]
[ 3.9491303 11.890509 ]
[ 4.0707936 11.39089 ]
[ 2.9190874 10.892869 ]
[ 3.3545637 11.168913 ]
[ 4.4103 11.181365 ]
[ 3.8460164 11.089968 ]
[ 4.1469483 11.413539 ]
[ 3.4575093 11.063723 ]
[ 6.0131817 11.509391 ]
[ 4.1197085 11.163555 ]
[ 2.7775195 11.190386 ]
[ 4.429281 11.460817 ]
[ 8.390841 14.874726 ]
[ 3.235461 11.2072315]
[ 4.235032 11.44532 ]
[ 2.995945 11.226589 ]
[ 3.3277357 11.243403 ]
[ 3.3108225 11.218366 ]
[ 3.3176475 11.35452 ]
[ 2.109938 11.236705 ]
[ 3.002455 11.343922 ]
[ 3.436967 11.26872 ]
[ 3.2733667 11.137366 ]
[ 2.9385054 11.335537 ]
[ 4.162654 11.348904 ]
[ 4.5524693 11.283983 ]
[ 3.338965 11.596802 ]
[ 3.3358197 10.902061 ]
[ 3.2375433 11.431944 ]
[ 3.8851342 11.228032 ]
[ 3.832893 11.102856 ]
[ 3.0639284 11.184585 ]
[ 5.4026403 11.872373 ]
[ 3.436244 11.339251 ]
[ 4.1002192 10.538921 ]
[ 4.8037586 11.938636 ]
[ 2.4958496 11.115103 ]
[ 4.387678 11.39821 ]
[ 6.6449327 11.151266 ]
[ 3.29677 11.21484 ]
[ 4.122499 11.154516 ]
[ 1.7930413 11.0290985]
[ 5.5800824 10.022876 ]
[ 9.71553 5.4919987]
[ 3.5139241 10.894391 ]
[ 7.632844 13.533806 ]
[ 6.8210125 12.725449 ]
[ 7.511883 14.133621 ]
[ 8.895927 13.728383 ]
[ 5.1962295 11.429665 ]
[ 3.903194 11.177962 ]
[ 3.860395 11.177652 ]
[ 3.9986382 11.306002 ]
[ 4.363597 11.834767 ]
[ 4.7169952 11.365067 ]
[ 2.5398011 11.24234 ]
[ 4.3293076 11.04855 ]
[ 6.6717644 12.498488 ]
[ 4.467716 11.522428 ]
[ 4.5951586 11.6574955]
[ 9.624587 5.764948 ]
[ 3.0008311 11.074772 ]
[ 2.8261948 11.079855 ]
[ 4.657835 11.048415 ]
[ 9.982653 5.421173 ]
[ 3.5921469 10.933281 ]
[ 5.834969 11.811183 ]
[ 3.6022296 11.387233 ]
[ 3.9883478 11.266834 ]
[ 4.1219406 11.881109 ]
[ 8.280401 7.0490394]
[ 5.1920485 12.072052 ]
[ 9.924583 14.969906 ]
LAUNCH INFO 2023-10-07 09:06:01,443 Pod failed
[2023-10-07 09:06:01,443] [ INFO] controller.py:115 - Pod failed
LAUNCH ERROR 2023-10-07 09:06:01,443 Container failed !!!
问题确认 Search before asking
请提出你的问题 Please ask your question
在rtdetr训练过程中,频发以下类似报错。 且该报错往往发生在模型已经完成几个epoch训练,且已完成evaluate之后。每次重新训练模型报错发生时间均不相同。 已下调学习率,但仍然会出现这个报错。
File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/losses/detr_loss.py", line 290, in _get_prediction_loss boxes, logits, gt_bbox, gt_class, masks=masks, gt_mask=gt_mask) File "/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__ return self.forward(*inputs, **kwargs) File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/transformers/matchers.py", line 180, in forward for i, c in enumerate(C) File "/home/Paddle/PaddleDetection-develop/ppdet/modeling/transformers/matchers.py", line 180, in <listcomp> for i, c in enumerate(C) File "/home/data/anaconda3/envs/py37_paddle2.4_cu11_dev/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment return _lsap_module.calculate_assignment(cost_matrix) ValueError: matrix contains invalid numeric entries I0606 18:50:15.434149 270 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop