Open wxf764571829 opened 3 years ago
数据集本身是之前在faster-rcnn训练没问题的voc数据
- 现在用的那个算法 faster rcnn嘛
可以尝试
- 增加warmup的iter数
- 去掉warmup
- 减小学习率
采用的是faster_rcnn_r50_fpn_1x_coco 学习率试过不断减少但依旧nan,未见warmup参数的iter参数,仅有statr_factor 和step
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step值
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1
step值
epoch: 12
LearningRate: base_lr: 0.0025 schedulers:
OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0001 type: L2 您好,step指的是这里的Warmup下的steps吧?这个值我修改过,不断增大,但实际loss还是会变成nan
我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights
我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights
我测试过单卡 也是一样的,这边数据集是我自己的数据集 但是在win10下(cpu)跑paddlex 都没问题 同样数据集拿到ubuntu下 改了对应配置lr和 warmup下的steps 尝试了yolo和rcnn等 均出现典型梯度爆炸现象
我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights
修改了shuffle设置为false,训练测试,loss_bbox_cls,loss_bbox_reg,loss随着迭代越来越大,最终变成nan,多次测试,出现nan的位置不同,均为第一轮:[ 580/3308] 时出现一次,[ 80/3308]出现一次,[ 20/3308]出现一次
可以麻烦提供下少量的复现数据和配置文件吗
可以麻烦提供下少量的复现数据和配置文件吗
大概需要多少数据?
可以麻烦提供下少量的复现数据和配置文件吗
对应的部分数据和配置文件,刚刚有部分配置文件忘记给出来了 这个是完整的 链接:https://pan.baidu.com/share/init?surl=_uPu7T5ZNhOtlt3fm3pPsQ 提取码:myyk
看了下你的数据类别是只有一类吗?
看了下你的数据类别是只有一类吗?
是的 只有一类,num_class那里最开始设置的是一类,后面看到issue上,有大佬说要包含背景类 就改成了两类
类别数在动态图下统一是不包含背景类的,在静态图下类别数需要加1。调整成1之后,在你给的小数据集下训练正常。看上去bbox回归分支还比较小,可能受数据少的影响,模型还没有比较好的学习到正样本. 建议在大数据集下尝试,或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights
类别数在动态图下统一是不包含背景类的,在静态图下类别数需要加1。调整成1之后,在你给的小数据集下训练正常。看上去bbox回归分支还比较小,可能受数据少的影响,模型还没有比较好的学习到正样本. 建议在大数据集下尝试,或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights
这是我修改后num_class==1其余保持和您一致的单卡训练,只不过是在大数据下,然而他依旧梯度爆炸,我在想这是否与我环境有关 ubuntu16.04
(paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c configs/faster_rcnn/faster_rcnn_r50fpn
1x_coco.yml
/home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
/home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddledet-2.1.0-py3.8.egg/ppdet/data/source/voc.py:92: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
[06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/91.jpg, and it will be ignored
[06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/61.jpg, and it will be ignored
[06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/5.jpg, and it will be ignored
[06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/62.jpg, and it will be ignored
[06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/931.jpg, and it will be ignored
W0610 10:49:24.785656 9286 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0610 10:49:24.788290 9286 device_context.cc:422] device: 0, cuDNN Version: 7.6.
/home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
[06/10 10:49:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams
[06/10 10:49:27] ppdet.engine INFO: Epoch: [0] [ 0/6615] learning_rate: 0.000010 loss_rpn_cls: 0.700178 loss_rpn_reg: 0.012245 loss_bbox_cls: 0.765692 loss_bbox_reg: 0.000132 loss: 1.478247 eta: 4:51:11 batch_cost: 0.2201 data_cost: 0.0002 ips: 4.5435 images/s
[06/10 10:49:32] ppdet.engine INFO: Epoch: [0] [ 20/6615] learning_rate: 0.000011 loss_rpn_cls: 0.697189 loss_rpn_reg: 0.087577 loss_bbox_cls: 0.697171 loss_bbox_reg: 86374406375995382823944978432.000000 loss: 86374406375995382823944978432.000000 eta: 5:11:52 batch_cost: 0.2366 data_cost: 0.0002 ips: 4.2269 images/s
[06/10 10:49:37] ppdet.engine INFO: Epoch: [0] [ 40/6615] learning_rate: 0.000011 loss_rpn_cls: 0.696582 loss_rpn_reg: 0.068122 loss_bbox_cls: 0.500473 loss_bbox_reg: 222237815671985718072440782848.000000 loss: 222237815671985718072440782848.000000 eta: 5:17:14 batch_cost: 0.2442 data_cost: 0.0002 ips: 4.0946 images/s
^CTraceback (most recent call last):
链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据,可以放在PaddleDetection的目录下,然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据,可以放在PaddleDetection的目录下,然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
结果如下:依然梯度爆炸
(paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
/home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
W0610 14:18:15.254403 11173 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0610 14:18:15.257342 11173 device_context.cc:422] device: 0, cuDNN Version: 7.6.
/home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
[06/10 14:18:17] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams
[06/10 14:18:17] ppdet.engine INFO: Epoch: [0] [ 0/30] learning_rate: 0.000010 loss_rpn_cls: 0.695479 loss_rpn_reg: 0.191895 loss_bbox_cls: 0.756262 loss_bbox_reg: 0.000104 loss: 1.643740 eta: 0:01:25 batch_cost: 0.2375 data_cost: 0.0002 ips: 4.2110 images/s
[06/10 14:18:22] ppdet.engine INFO: Epoch: [0] [20/30] learning_rate: 0.000011 loss_rpn_cls: 0.699317 loss_rpn_reg: 0.043817 loss_bbox_cls: 0.656201 loss_bbox_reg: 30036017106796001248673792.000000 loss: 30036017106796001248673792.000000 eta: 0:01:17 batch_cost: 0.2276 data_cost: 0.0001 ips: 4.3931 images/s
[06/10 14:18:28] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco
[06/10 14:18:28] ppdet.engine INFO: Epoch: [1] [ 0/30] learning_rate: 0.000011 loss_rpn_cls: 0.699256 loss_rpn_reg: 0.015225 loss_bbox_cls: 0.527988 loss_bbox_reg: 1070040945795569183057053220864.000000 loss: 1070040945795569183057053220864.000000 eta: 0:01:14 batch_cost: 0.2237 data_cost: 0.0001 ips: 4.4710 images/s
[06/10 14:18:33] ppdet.engine INFO: Epoch: [1] [20/30] learning_rate: 0.000012 loss_rpn_cls: 0.697815 loss_rpn_reg: 0.048037 loss_bbox_cls: 0.304180 loss_bbox_reg: 12224268613829214304034596323328.000000 loss: 12224268613829214304034596323328.000000 eta: 0:01:09 batch_cost: 0.2254 data_cost: 0.0001 ips: 4.4359 images/s
[06/10 14:18:37] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco
确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常
确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常
确认没有修改过任何代码部分,我连代码的py文件都没有点开过,以下是最终的收尾: [06/10 14:22:31] ppdet.engine INFO: Epoch: [10] [ 0/30] learning_rate: 0.000019 loss_rpn_cls: 0.604167 loss_rpn_reg: 0.014151 loss_bbox_cls: 0.053558 loss_bbox_reg: 7307942676923507792863324798976.000000 loss: 7307942676923507792863324798976.000000 eta: 0:00:13 batch_cost: 0.2378 data_cost: 0.0001 ips: 4.2050 images/s [06/10 14:22:36] ppdet.engine INFO: Epoch: [10] [20/30] learning_rate: 0.000020 loss_rpn_cls: 0.581559 loss_rpn_reg: 0.040997 loss_bbox_cls: 0.042619 loss_bbox_reg: 7754301059115801562119164395520.000000 loss: 7754301059115801562119164395520.000000 eta: 0:00:09 batch_cost: 0.2465 data_cost: 0.0002 ips: 4.0574 images/s [06/10 14:22:41] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco [06/10 14:22:41] ppdet.engine INFO: Epoch: [11] [ 0/30] learning_rate: 0.000020 loss_rpn_cls: 0.567019 loss_rpn_reg: 0.023953 loss_bbox_cls: 0.040032 loss_bbox_reg: 7839933506159474397765714313216.000000 loss: 7839933506159474397765714313216.000000 eta: 0:00:06 batch_cost: 0.2382 data_cost: 0.0002 ips: 4.1977 images/s [06/10 14:22:46] ppdet.engine INFO: Epoch: [11] [20/30] learning_rate: 0.000021 loss_rpn_cls: 0.532371 loss_rpn_reg: 0.036571 loss_bbox_cls: 0.036838 loss_bbox_reg: 9653358503356006598700219498496.000000 loss: 9653358503356006598700219498496.000000 eta: 0:00:02 batch_cost: 0.2492 data_cost: 0.0002 ips: 4.0136 images/s [06/10 14:22:50] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco
确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常
所以我一直找不到原因,然后2.1版本更新了ubuntu16.04环境,所以才在想是不是我本身环境哪里有问题
确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常
为了确认没有修改代码,我重新按照快速按装文档操作了一遍,用您那边给我传回的数据来训练,依然出现梯度爆炸
确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常
想问下,您那边整体环境是多少,这边如果确实没办法 我就尝试重新按照环境了
我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的,可以尝试到AI Studio上或者使用docker进行训练,环境相对稳定一些
我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的,可以尝试到AI Studio上或者使用docker进行训练,环境相对稳定一些
这边因为驱动原因只能安装CUDA10.1 其余环境和您的保持一致后训练,问题依旧存在
[06/08 16:58:10] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/08 16:58:12] ppdet.engine INFO: Epoch: [0] [ 0/3308] learning_rate: 0.000025 loss_rpn_cls: 0.698832 loss_rpn_reg: 0.319019 loss_bbox_cls: 1.223035 loss_bbox_reg: 0.000154 loss: 2.241040 eta: 13:33:58 batch_cost: 1.2303 data_cost: 0.0002 ips: 0.8128 images/s [06/08 16:58:17] ppdet.engine INFO: Epoch: [0] [ 20/3308] learning_rate: 0.000030 loss_rpn_cls: 0.695908 loss_rpn_reg: 0.092636 loss_bbox_cls: 1.146842 loss_bbox_reg: 8035386370816811357582655488.000000 loss: 8035386370816811357582655488.000000 eta: 3:18:52 batch_cost: 0.2543 data_cost: 0.0002 ips: 3.9328 images/s [06/08 16:58:22] ppdet.engine INFO: Epoch: [0] [ 40/3308] learning_rate: 0.000034 loss_rpn_cls: 0.692222 loss_rpn_reg: 0.074917 loss_bbox_cls: 2149442433699935594479616.000000 loss_bbox_reg: 5083389209306981538767680045056.000000 loss: 5083391627158620768026029457408.000000 eta: 3:06:06 batch_cost: 0.2614 data_cost: 0.0001 ips: 3.8251 images/s