PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.56k stars 2.86k forks source link

YOLOv7训练报错 #7307

Open linhandev opened 1 year ago

linhandev commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

在极小bs训练时,yolo v7 损失函数报错。

复现:用bs=1,在coco上跑yolo v7 tiny的训练。

看起来是因为bs太小,一个batch内没有可以计算loss的bb。bs调到32后可以正常训练。

注意到readme里有写不建议小bs训练,感觉这个肯定影响最后准确率,但是在小bs下跑不起来感觉算是个bug。感觉理论上就算bs调大,如果训练数据中有一些没有标注的图片随机到了一个batch里应该也会有一样的问题。

Traceback (most recent call last):
  File "/home/lin/Desktop/git/eye/PaddleYOLO/tools/train.py", line 172, in <module>
    main()
  File "/home/lin/Desktop/git/eye/PaddleYOLO/tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "/home/lin/Desktop/git/eye/PaddleYOLO/tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/engine/trainer.py", line 403, in train
    outputs = model(data)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 92, in get_loss
    return self._forward()
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 77, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/modeling/heads/yolov7_head.py", line 208, in forward
    return self.loss(yolo_outputs, targets, self.anchors)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/lin/miniconda3/envs/pd/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/lin/Desktop/git/eye/PaddleYOLO/ppdet/modeling/losses/yolov7_loss.py", line 107, in forward
    yolov7_gt_index = paddle.to_tensor(np.concatenate(img_idx), 'float32')
  File "<__array_function__ internals>", line 180, in concatenate
ValueError: need at least one array to concatenate

复现环境 Environment

Linux paddle 2.3.2 release2.5和develop py 3.10

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

研究了一通但是代码有点复杂,我好像不太能解决。

nemonameless commented 1 year ago

测试了coco数据集 bs=1单卡训,不会报错的。你是不是设置了 allow_empty: true ?默认是false。不然img_idx不可能为空的。https://github.com/PaddlePaddle/PaddleYOLO/blob/release/2.5/ppdet/modeling/losses/yolov7_loss.py#L106 yolov7不支持空标注或无框的图片加进去训练的,不能设置allow_empty: true。

linhandev commented 1 year ago

allow_empty没有true。我在debug的时候发现这个了,当时在代码里打出来过allow_empty相关变量,他是不允许空的,后来把数据集在给到pddet之前过滤了一下去掉了所有没有标签的图,还是一样的问题。

我之前在aistudio上测试直接拉代码用百度之前一个比赛在studio上的coco数据集也会类似报错,这会有点忙,我大概一天之内可以再试一下在studio上能不能复现。

linhandev commented 1 year ago

coco上报错当时也不是跑起来就报错,是需要跑一阵几十个batch之后可能遇到一个。当时我那个数据集是一张图就一个bb

nemonameless commented 1 year ago

可以发下稳定复现这个问题的数据集谢谢, aistudio 项目链接或网盘都行,以便排查问题。

linhandev commented 1 year ago

嗯,正在赶作业,大概一天内aistudio弄个环境出来

Zhw1997 commented 1 year ago

我使用roadvoc示例数据集大约100个epoh后也出现了 当时是挂机着的

linhandev commented 1 year ago

拖了好久 /笑哭

跑多少出这个问题不太稳定,aistudio上的这次运行跑了600多张报错。

Traceback (most recent call last):
  File "tools/train.py", line 172, in <module>
    main()
  File "tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/home/aistudio/PaddleYOLO/ppdet/engine/trainer.py", line 411, in train
    outputs = model(data)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/aistudio/PaddleYOLO/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/home/aistudio/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 92, in get_loss
    return self._forward()
  File "/home/aistudio/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 77, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/aistudio/PaddleYOLO/ppdet/modeling/heads/yolov7_head.py", line 208, in forward
    return self.loss(yolo_outputs, targets, self.anchors)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/aistudio/PaddleYOLO/ppdet/modeling/losses/yolov7_loss.py", line 106, in forward
    yolov7_gt_index = paddle.to_tensor(np.concatenate(img_idx), 'float32')
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

https://aistudio.baidu.com/aistudio/projectdetail/5079916

蓝色的这个版本

image

@nemonameless

linhandev commented 1 year ago

理论上想了一下,yolo v7的数据增强默认有translate,一个图在load的时候可能是有bb的,但如果很靠边,translate一下这个bb是不是可能就出去了。

nemonameless commented 1 year ago

理论上想了一下,yolo v7的数据增强默认有translate,一个图在load的时候可能是有bb的,但如果很靠边,translate一下这个bb是不是可能就出去了。

可能是的,但这个概率极小。所以bs尽可能开大,即使有空样本也是极少的,concat后就没有影响。总bs=1去训也没有意义啊。 每个图只有1个gt box的数据集,各个检测器能力差距也不大。

linhandev commented 1 year ago

感觉肯能加个什么特判比较好,概率低也还是有可能。不过就我用的这个数据集,很多只有一个bb,有些还靠边,bs开到16没再遇到这个。

nemonameless commented 1 year ago

原版yolov7默认bs每卡32共8卡,总共256的,总bs差这个数字太多会很影响训练精度。

linhandev commented 1 year ago

那看来没钱就不要考虑v7了 /笑哭

linhandev commented 1 year ago

原版yolov7默认bs每卡32共8卡,总共256的,总bs差这个数字太多会很影响训练精度。

今天一个128的bs也报了这个错。。我这属于点贼背了。 MosaicPerspective 没开translate

Traceback (most recent call last):
  File "/scratch/lh3317/git/eye/PaddleYOLO/tools/train.py", line 172, in <module>
    main()
  File "/scratch/lh3317/git/eye/PaddleYOLO/tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "/scratch/lh3317/git/eye/PaddleYOLO/tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/engine/trainer.py", line 403, in train
    outputs = model(data)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 94, in get_loss
    return self._forward()
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/modeling/architectures/yolov5.py", line 79, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/modeling/heads/yolov7_head.py", line 208, in forward
    return self.loss(yolo_outputs, targets, self.anchors)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/ext3/miniconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/scratch/lh3317/git/eye/PaddleYOLO/ppdet/modeling/losses/yolov7_loss.py", line 106, in forward
    yolov7_gt_index = paddle.to_tensor(np.concatenate(img_idx), 'float32')
  File "<__array_function__ internals>", line 180, in concatenate
nemonameless commented 1 year ago

配置文件改动了什么也请发一下

linhandev commented 1 year ago

嗯,我这个也重开个训练,在命令行指定bs,复现了发一下

linhandev commented 1 year ago

之前还遇到一问题,bs比较大,感觉训练时显存基本吃满的时候,eval之后回到训练的第一个batch会报显存不足,感觉像是eval用到的显存没有正确释放那种。再遇到我贴下配置和cmd输出