PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.62k stars 2.87k forks source link

学习率的疑惑,loss随着学习不断变大,最终为nan ,2.1版本,2GPU,lr0.00025,其余除了类个数均未修改 #3326

Open wxf764571829 opened 3 years ago

wxf764571829 commented 3 years ago

[06/08 16:58:10] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/08 16:58:12] ppdet.engine INFO: Epoch: [0] [ 0/3308] learning_rate: 0.000025 loss_rpn_cls: 0.698832 loss_rpn_reg: 0.319019 loss_bbox_cls: 1.223035 loss_bbox_reg: 0.000154 loss: 2.241040 eta: 13:33:58 batch_cost: 1.2303 data_cost: 0.0002 ips: 0.8128 images/s [06/08 16:58:17] ppdet.engine INFO: Epoch: [0] [ 20/3308] learning_rate: 0.000030 loss_rpn_cls: 0.695908 loss_rpn_reg: 0.092636 loss_bbox_cls: 1.146842 loss_bbox_reg: 8035386370816811357582655488.000000 loss: 8035386370816811357582655488.000000 eta: 3:18:52 batch_cost: 0.2543 data_cost: 0.0002 ips: 3.9328 images/s [06/08 16:58:22] ppdet.engine INFO: Epoch: [0] [ 40/3308] learning_rate: 0.000034 loss_rpn_cls: 0.692222 loss_rpn_reg: 0.074917 loss_bbox_cls: 2149442433699935594479616.000000 loss_bbox_reg: 5083389209306981538767680045056.000000 loss: 5083391627158620768026029457408.000000 eta: 3:06:06 batch_cost: 0.2614 data_cost: 0.0001 ips: 3.8251 images/s

wxf764571829 commented 3 years ago

数据集本身是之前在faster-rcnn训练没问题的voc数据

lyuwenyu commented 3 years ago
wxf764571829 commented 3 years ago
  • 现在用的那个算法 faster rcnn嘛
  • 可以尝试

    • 增加warmup的iter数
    • 去掉warmup
    • 减小学习率

采用的是faster_rcnn_r50_fpn_1x_coco 学习率试过不断减少但依旧nan,未见warmup参数的iter参数,仅有statr_factor 和step

lyuwenyu commented 3 years ago
wxf764571829 commented 3 years ago

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

lyuwenyu commented 3 years ago

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step值

wxf764571829 commented 3 years ago

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果, 我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step值

epoch: 12

LearningRate: base_lr: 0.0025 schedulers:

OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0001 type: L2 您好,step指的是这里的Warmup下的steps吧?这个值我修改过,不断增大,但实际loss还是会变成nan

jerrywgz commented 3 years ago

image 我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights

wxf764571829 commented 3 years ago

image 我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights

我测试过单卡 也是一样的,这边数据集是我自己的数据集 但是在win10下(cpu)跑paddlex 都没问题 同样数据集拿到ubuntu下 改了对应配置lr和 warmup下的steps 尝试了yolo和rcnn等 均出现典型梯度爆炸现象

wxf764571829 commented 3 years ago

image 我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集,是可以正常训练的,你这边可以尝试先使用单卡训练,然后将reader配置文件中的shuffle设置为false,观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式,也就是加载coco数据集上训练好的模型作为pretrain_weights

修改了shuffle设置为false,训练测试,loss_bbox_cls,loss_bbox_reg,loss随着迭代越来越大,最终变成nan,多次测试,出现nan的位置不同,均为第一轮:[ 580/3308] 时出现一次,[ 80/3308]出现一次,[ 20/3308]出现一次

jerrywgz commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

wxf764571829 commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

大概需要多少数据?

wxf764571829 commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

对应的部分数据和配置文件,刚刚有部分配置文件忘记给出来了 这个是完整的 链接:https://pan.baidu.com/share/init?surl=_uPu7T5ZNhOtlt3fm3pPsQ 提取码:myyk

jerrywgz commented 3 years ago

看了下你的数据类别是只有一类吗?

wxf764571829 commented 3 years ago

看了下你的数据类别是只有一类吗?

是的 只有一类,num_class那里最开始设置的是一类,后面看到issue上,有大佬说要包含背景类 就改成了两类

jerrywgz commented 3 years ago

image 类别数在动态图下统一是不包含背景类的,在静态图下类别数需要加1。调整成1之后,在你给的小数据集下训练正常。看上去bbox回归分支还比较小,可能受数据少的影响,模型还没有比较好的学习到正样本. 建议在大数据集下尝试,或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights

wxf764571829 commented 3 years ago

image 类别数在动态图下统一是不包含背景类的,在静态图下类别数需要加1。调整成1之后,在你给的小数据集下训练正常。看上去bbox回归分支还比较小,可能受数据少的影响,模型还没有比较好的学习到正样本. 建议在大数据集下尝试,或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights

这是我修改后num_class==1其余保持和您一致的单卡训练,只不过是在大数据下,然而他依旧梯度爆炸,我在想这是否与我环境有关 ubuntu16.04

(paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c configs/faster_rcnn/faster_rcnn_r50fpn 1x_coco.yml /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddledet-2.1.0-py3.8.egg/ppdet/data/source/voc.py:92: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/91.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/61.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/5.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/62.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/931.jpg, and it will be ignored W0610 10:49:24.785656 9286 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1 W0610 10:49:24.788290 9286 device_context.cc:422] device: 0, cuDNN Version: 7.6. /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: [06/10 10:49:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/10 10:49:27] ppdet.engine INFO: Epoch: [0] [ 0/6615] learning_rate: 0.000010 loss_rpn_cls: 0.700178 loss_rpn_reg: 0.012245 loss_bbox_cls: 0.765692 loss_bbox_reg: 0.000132 loss: 1.478247 eta: 4:51:11 batch_cost: 0.2201 data_cost: 0.0002 ips: 4.5435 images/s [06/10 10:49:32] ppdet.engine INFO: Epoch: [0] [ 20/6615] learning_rate: 0.000011 loss_rpn_cls: 0.697189 loss_rpn_reg: 0.087577 loss_bbox_cls: 0.697171 loss_bbox_reg: 86374406375995382823944978432.000000 loss: 86374406375995382823944978432.000000 eta: 5:11:52 batch_cost: 0.2366 data_cost: 0.0002 ips: 4.2269 images/s [06/10 10:49:37] ppdet.engine INFO: Epoch: [0] [ 40/6615] learning_rate: 0.000011 loss_rpn_cls: 0.696582 loss_rpn_reg: 0.068122 loss_bbox_cls: 0.500473 loss_bbox_reg: 222237815671985718072440782848.000000 loss: 222237815671985718072440782848.000000 eta: 5:17:14 batch_cost: 0.2442 data_cost: 0.0002 ips: 4.0946 images/s ^CTraceback (most recent call last):

jerrywgz commented 3 years ago

链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据,可以放在PaddleDetection的目录下,然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml

wxf764571829 commented 3 years ago

链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据,可以放在PaddleDetection的目录下,然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml

结果如下:依然梯度爆炸 (paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: W0610 14:18:15.254403 11173 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1 W0610 14:18:15.257342 11173 device_context.cc:422] device: 0, cuDNN Version: 7.6. /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: [06/10 14:18:17] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/10 14:18:17] ppdet.engine INFO: Epoch: [0] [ 0/30] learning_rate: 0.000010 loss_rpn_cls: 0.695479 loss_rpn_reg: 0.191895 loss_bbox_cls: 0.756262 loss_bbox_reg: 0.000104 loss: 1.643740 eta: 0:01:25 batch_cost: 0.2375 data_cost: 0.0002 ips: 4.2110 images/s [06/10 14:18:22] ppdet.engine INFO: Epoch: [0] [20/30] learning_rate: 0.000011 loss_rpn_cls: 0.699317 loss_rpn_reg: 0.043817 loss_bbox_cls: 0.656201 loss_bbox_reg: 30036017106796001248673792.000000 loss: 30036017106796001248673792.000000 eta: 0:01:17 batch_cost: 0.2276 data_cost: 0.0001 ips: 4.3931 images/s [06/10 14:18:28] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco [06/10 14:18:28] ppdet.engine INFO: Epoch: [1] [ 0/30] learning_rate: 0.000011 loss_rpn_cls: 0.699256 loss_rpn_reg: 0.015225 loss_bbox_cls: 0.527988 loss_bbox_reg: 1070040945795569183057053220864.000000 loss: 1070040945795569183057053220864.000000 eta: 0:01:14 batch_cost: 0.2237 data_cost: 0.0001 ips: 4.4710 images/s [06/10 14:18:33] ppdet.engine INFO: Epoch: [1] [20/30] learning_rate: 0.000012 loss_rpn_cls: 0.697815 loss_rpn_reg: 0.048037 loss_bbox_cls: 0.304180 loss_bbox_reg: 12224268613829214304034596323328.000000 loss: 12224268613829214304034596323328.000000 eta: 0:01:09 batch_cost: 0.2254 data_cost: 0.0001 ips: 4.4359 images/s [06/10 14:18:37] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco

jerrywgz commented 3 years ago

确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常

确认没有修改过任何代码部分,我连代码的py文件都没有点开过,以下是最终的收尾: [06/10 14:22:31] ppdet.engine INFO: Epoch: [10] [ 0/30] learning_rate: 0.000019 loss_rpn_cls: 0.604167 loss_rpn_reg: 0.014151 loss_bbox_cls: 0.053558 loss_bbox_reg: 7307942676923507792863324798976.000000 loss: 7307942676923507792863324798976.000000 eta: 0:00:13 batch_cost: 0.2378 data_cost: 0.0001 ips: 4.2050 images/s [06/10 14:22:36] ppdet.engine INFO: Epoch: [10] [20/30] learning_rate: 0.000020 loss_rpn_cls: 0.581559 loss_rpn_reg: 0.040997 loss_bbox_cls: 0.042619 loss_bbox_reg: 7754301059115801562119164395520.000000 loss: 7754301059115801562119164395520.000000 eta: 0:00:09 batch_cost: 0.2465 data_cost: 0.0002 ips: 4.0574 images/s [06/10 14:22:41] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco [06/10 14:22:41] ppdet.engine INFO: Epoch: [11] [ 0/30] learning_rate: 0.000020 loss_rpn_cls: 0.567019 loss_rpn_reg: 0.023953 loss_bbox_cls: 0.040032 loss_bbox_reg: 7839933506159474397765714313216.000000 loss: 7839933506159474397765714313216.000000 eta: 0:00:06 batch_cost: 0.2382 data_cost: 0.0002 ips: 4.1977 images/s [06/10 14:22:46] ppdet.engine INFO: Epoch: [11] [20/30] learning_rate: 0.000021 loss_rpn_cls: 0.532371 loss_rpn_reg: 0.036571 loss_bbox_cls: 0.036838 loss_bbox_reg: 9653358503356006598700219498496.000000 loss: 9653358503356006598700219498496.000000 eta: 0:00:02 batch_cost: 0.2492 data_cost: 0.0002 ips: 4.0136 images/s [06/10 14:22:50] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常

所以我一直找不到原因,然后2.1版本更新了ubuntu16.04环境,所以才在想是不是我本身环境哪里有问题

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常

为了确认没有修改代码,我重新按照快速按装文档操作了一遍,用您那边给我传回的数据来训练,依然出现梯度爆炸

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗?看上去如果bbox_reg分支过大,后续的训练肯定会影响其他分支,但是看起来rpn分支和bbox_cls分支的loss看起来还正常

想问下,您那边整体环境是多少,这边如果确实没办法 我就尝试重新按照环境了

jerrywgz commented 3 years ago

我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的,可以尝试到AI Studio上或者使用docker进行训练,环境相对稳定一些

wxf764571829 commented 3 years ago

我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的,可以尝试到AI Studio上或者使用docker进行训练,环境相对稳定一些

这边因为驱动原因只能安装CUDA10.1 其余环境和您的保持一致后训练,问题依旧存在