argusswift / YOLOv4-pytorch

This is a pytorch repository of YOLOv4, attentive YOLOv4 and mobilenet YOLOv4 with PASCAL VOC and COCO
1.68k stars 329 forks source link

训练过程,loss为nan? #33

Open kunyaoli opened 4 years ago

kunyaoli commented 4 years ago

[2020-10-10 00:46:31,414]-[train.py line:147]: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 INFO:YOLOv4: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

您好,这个是我的数据集的问题吗?

argusswift commented 4 years ago

[2020-10-10 00:46:31,414]-[train.py line:147]: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 INFO:YOLOv4: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

您好,这个是我的数据集的问题吗?

请检查一下输入的图片和标注是否正确,或者是调小初始学习率。

lizhimll commented 4 years ago

[2020-10-10 00:46:31,414]-[train.py line:147]: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 INFO:YOLOv4: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. 您好,这个是我的数据集的问题吗?

请检查一下输入的图片和标注是否正确,或者是调小初始学习率。

感觉图片和标注没问题,也调小了学习率,但还是不行

lizhimll commented 4 years ago

[2020-10-10 00:46:31,414]-[train.py line:147]: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 INFO:YOLOv4: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

您好,这个是我的数据集的问题吗?

可以检查下预训练权重

sakurasakura1996 commented 4 years ago

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

lizhimll commented 4 years ago

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

我也是这种,我是没加预训练权重

lizhimll commented 4 years ago

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

可能是数据集图片少了,这些nan对应的类别太少

sakurasakura1996 commented 4 years ago

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

可能是数据集图片少了,这些nan对应的类别太少

我解决了,我发现是我xml标注文件和我配置文件中的类别名有出入。谢谢啦

lizhimll commented 4 years ago

这个问题一般出在数据集,格式读取,预训练权重

sakurasakura1996 commented 4 years ago

这个问题一般出在数据集,格式读取,预训练权重

是的,我遇到了一个问题,我训练的数据有6类,前5类的准确率还算正常,就是最后一类的准确率低的离谱,数据量的话,最后一类的数据量也并不比其他类的少很多啊。 [2020-10-12 16:45:31,953]-[train.py line:168]:boerner --> mAP : 0.9750386173134524 [2020-10-12 16:45:31,953]-[train.py line:168]:linnaeus --> mAP : 0.8951772740175405 [2020-10-12 16:45:31,953]-[train.py line:168]:armandi --> mAP : 0.8176623972154735 [2020-10-12 16:45:31,953]-[train.py line:168]:coleoptera --> mAP : 0.5992928800293161 [2020-10-12 16:45:31,953]-[train.py line:168]:leconte --> mAP : 0.9597286581948914 [2020-10-12 16:45:31,953]-[train.py line:168]:acuminatus --> mAP : 0.5934933654323262 [2020-10-12 16:45:31,953]-[train.py line:171]:mAP : 0.8067321987005002 这是刚训练出来的结果,现在最后一类还好一点了,特别了前10轮,最后一类的准确率比其他类的低很多。你知道这可能是什么导致的,我后续再确认确认是不是这里面的第4类和第6类数据量比其他小很多。

argusswift commented 4 years ago

,我遇到了一个问题,我训练的数据有6类,前5类的准确率还算正常,就是最后一类的准确率低的离谱,数据量的话,最后一类的数据量也并不比其他类的少很多啊。

这可能是最后一类的样本学习困难程度比其他的类高,才导致最开始前几轮他的精度相比于其他5类更低一点。应该后面会好很多。

cangwang commented 3 years ago

很可能是因为预测到的wh值过小,导致iou为nan,可以检查一下训练时候是否出现,自己打印吧

lidanyang916 commented 3 years ago

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

可能是数据集图片少了,这些nan对应的类别太少

你好,我训练得到的结果像下面这样,请问可能是什么问题。 D:\pycharm_project\YOLOv4-pytorch\eval\voc_eval.py:194: RuntimeWarning: invalid value encountered in true_divide rec = tp / float(npos) [2020-10-11 16:42:09,302]-[train.py line:168]:boerner --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:linnaeus --> mAP : nan [2020-10-11 16:42:09,302]-[train.py line:168]:armandi --> mAP : 0.37548864193200204 [2020-10-11 16:42:09,303]-[train.py line:168]:coleoptera --> mAP : 0.5702350413321754 [2020-10-11 16:42:09,303]-[train.py line:168]:leconte --> mAP : nan [2020-10-11 16:42:09,303]-[train.py line:168]:acuminatus --> mAP : 0.0 [2020-10-11 16:42:09,303]-[train.py line:171]:mAP : nan [2020-10-11 16:42:09,303]-[train.py line:172]:inference time: 27.98 ms WARNING:root:NaN or Inf found in input tensor. [2020-10-11 16:42:10,940]-[train.py line:175]:save weights done INFO:YOLOv4:save weights done [2020-10-11 16:42:10,940]-[train.py line:176]: ===test mAP:nan INFO:YOLOv4: ===test mAP:nan [2020-10-11 16:42:10,940]-[train.py line:190]: ===cost time:367.4231s INFO:YOLOv4: ===cost time:367.4231s

我也是这种,我是没加预训练权重

[2020-10-10 00:46:31,414]-[train.py line:147]: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 INFO:YOLOv4: === Epoch:[ 13/120],step:[260/377],img_size:[416],total_loss:nan|loss_ciou:nan|loss_conf:nan|loss_cls:nan|lr:0.0075 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

您好,这个是我的数据集的问题吗? 你好,你这问题解决了吗?你训练的时候加载预训练权重了吗?我加载CSPDarknet53的预训练权重也出现了这种情况。

jcluo1994 commented 3 years ago

这个问题一般出在数据集,格式读取,预训练权重

是的,我遇到了一个问题,我训练的数据有6类,前5类的准确率还算正常,就是最后一类的准确率低的离谱,数据量的话,最后一类的数据量也并不比其他类的少很多啊。 [2020-10-12 16:45:31,953]-[train.py line:168]:boerner --> mAP : 0.9750386173134524 [2020-10-12 16:45:31,953]-[train.py line:168]:linnaeus --> mAP : 0.8951772740175405 [2020-10-12 16:45:31,953]-[train.py line:168]:armandi --> mAP : 0.8176623972154735 [2020-10-12 16:45:31,953]-[train.py line:168]:coleoptera --> mAP : 0.5992928800293161 [2020-10-12 16:45:31,953]-[train.py line:168]:leconte --> mAP : 0.9597286581948914 [2020-10-12 16:45:31,953]-[train.py line:168]:acuminatus --> mAP : 0.5934933654323262 [2020-10-12 16:45:31,953]-[train.py line:171]:mAP : 0.8067321987005002 这是刚训练出来的结果,现在最后一类还好一点了,特别了前10轮,最后一类的准确率比其他类的低很多。你知道这可能是什么导致的,我后续再确认确认是不是这里面的第4类和第6类数据量比其他小很多。

请问前辈,您训练的指令是哪一个,为什么我的训练界面和您的不一样 021-04-04 11:42:50,025]-[train.py line:231]: === Epoch:[ 0/1],step:[800/3999],img_size:[320],total_loss:139.1539|loss_ciou:12.8198|loss_conf:93.6651|loss_cls:32.6689|lr:0.0000 [2021-04-04 11:42:57,675]-[train.py line:231]: === Epoch:[ 0/1],step:[810/3999],img_size:[512],total_loss:137.8566|loss_ciou:12.7569|loss_conf:92.6608|loss_cls:32.4390|lr:0.0000 [2021-04-04 11:43:05,241]-[train.py line:231]: === Epoch:[ 0/1],step:[820/3999],img_size:[320],total_loss:136.8155|loss_ciou:12.7250|loss_conf:91.7698|loss_cls:32.3207|lr:0.0000 [2021-04-04 11:43:12,338]-[train.py line:231]: === Epoch:[ 0/1],step:[830/3999],img_size:[608],total_loss:135.6349|loss_ciou:12.6716|loss_conf:90.8390|loss_cls:32.1244|lr:0.0000 [2021-04-04 11:43:23,625]-[train.py line:231]: === Epoch:[ 0/1],step:[840/3999],img_size:[512],total_loss:134.6324|loss_ciou:12.6458|loss_conf:89.9963|loss_cls:31.9903|lr:0.0000 [2021-04-04 11:43:31,375]-[train.py line:231]: === Epoch:[ 0/1],step:[850/3999],img_size:[576],total_loss:133.7692|loss_ciou:12.6432|loss_conf:89.2041|loss_cls:31.9220|lr:0.0000 [2021-04-04 11:43:39,431]-[train.py line:231]: === Epoch:[ 0/1],step:[860/3999],img_size:[480],total_loss:133.1054|loss_ciou:12.6745|loss_conf:88.4957|loss_cls:31.9352|lr:0.0000 [2021-04-04 11:43:48,386]-[train.py line:231]: === Epoch:[ 0/1],step:[870/3999],img_size:[416],total_loss:132.0829|loss_ciou:12.6527|loss_conf:87.6465|loss_cls:31.7837|lr:0.0000 [2021-04-04 11:43:55,337]-[train.py line:231]: === Epoch:[ 0/1],step:[880/3999],img_size:[512],total_loss:131.2412|loss_ciou:12.6635|loss_conf:86.8609|loss_cls:31.7168|lr:0.0000 [2021-04-04 11:44:03,249]-[train.py line:231]: === Epoch:[ 0/1],step:[890/3999],img_size:[544],total_loss:130.3938|loss_ciou:12.6486|loss_conf:86.1186|loss_cls:31.6266|lr:0.0000 [2021-04-04 11:44:12,242]-[train.py line:231]: === Epoch:[ 0/1],step:[900/3999],img_size:[416],total_loss:129.5677|loss_ciou:12.6361|loss_conf:85.3967|loss_cls:31.5350|lr:0.0000 [2021-04-04 11:44:19,359]-[train.py line:231]: === Epoch:[ 0/1],step:[910/3999],img_size:[416],total_loss:128.6858|loss_ciou:12.6112|loss_conf:8 这是我的训练界面