学习率的疑惑，loss随着学习不断变大，最终为nan ，2.1版本，2GPU，lr0.00025，其余除了类个数均未修改

wxf764571829 commented 3 years ago

[06/08 16:58:10] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/08 16:58:12] ppdet.engine INFO: Epoch: [0] [ 0/3308] learning_rate: 0.000025 loss_rpn_cls: 0.698832 loss_rpn_reg: 0.319019 loss_bbox_cls: 1.223035 loss_bbox_reg: 0.000154 loss: 2.241040 eta: 13:33:58 batch_cost: 1.2303 data_cost: 0.0002 ips: 0.8128 images/s [06/08 16:58:17] ppdet.engine INFO: Epoch: [0] [ 20/3308] learning_rate: 0.000030 loss_rpn_cls: 0.695908 loss_rpn_reg: 0.092636 loss_bbox_cls: 1.146842 loss_bbox_reg: 8035386370816811357582655488.000000 loss: 8035386370816811357582655488.000000 eta: 3:18:52 batch_cost: 0.2543 data_cost: 0.0002 ips: 3.9328 images/s [06/08 16:58:22] ppdet.engine INFO: Epoch: [0] [ 40/3308] learning_rate: 0.000034 loss_rpn_cls: 0.692222 loss_rpn_reg: 0.074917 loss_bbox_cls: 2149442433699935594479616.000000 loss_bbox_reg: 5083389209306981538767680045056.000000 loss: 5083391627158620768026029457408.000000 eta: 3:06:06 batch_cost: 0.2614 data_cost: 0.0001 ips: 3.8251 images/s

wxf764571829 commented 3 years ago

数据集本身是之前在faster-rcnn训练没问题的voc数据

lyuwenyu commented 3 years ago

现在用的那个算法 faster rcnn嘛
可以尝试
- 增加warmup的iter数
- 去掉warmup
- 减小学习率

wxf764571829 commented 3 years ago

现在用的那个算法 faster rcnn嘛

可以尝试

增加warmup的iter数

去掉warmup

减小学习率

采用的是faster_rcnn_r50_fpn_1x_coco 学习率试过不断减少但依旧nan,未见warmup参数的iter参数，仅有statr_factor 和step

lyuwenyu commented 3 years ago

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

wxf764571829 commented 3 years ago

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

lyuwenyu commented 3 years ago

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step值

wxf764571829 commented 3 years ago

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

可以适当增大step https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/optimizer_1x.yml#L11

step从1000增加到10W lr从0.0025到0.00000025 均无明显效果，我的环境python3.8 cuda10.1 ubuntu16.04 paddle2.1

step值

epoch: 12

LearningRate: base_lr: 0.0025 schedulers:

!PiecewiseDecay gamma: 0.1 milestones: [8, 11]
!LinearWarmup start_factor: 0.025 steps: 3000

OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0001 type: L2 您好，step指的是这里的Warmup下的steps吧？这个值我修改过，不断增大，但实际loss还是会变成nan

jerrywgz commented 3 years ago

我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集，是可以正常训练的，你这边可以尝试先使用单卡训练，然后将reader配置文件中的shuffle设置为false，观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式，也就是加载coco数据集上训练好的模型作为pretrain_weights

wxf764571829 commented 3 years ago

我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集，是可以正常训练的，你这边可以尝试先使用单卡训练，然后将reader配置文件中的shuffle设置为false，观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式，也就是加载coco数据集上训练好的模型作为pretrain_weights

我测试过单卡也是一样的，这边数据集是我自己的数据集但是在win10下（cpu）跑paddlex 都没问题同样数据集拿到ubuntu下改了对应配置lr和 warmup下的steps 尝试了yolo和rcnn等均出现典型梯度爆炸现象

wxf764571829 commented 3 years ago

我这边在两卡的环境下使用faster_rcnn_r50_fpn_1x_coco跑VOC数据集，是可以正常训练的，你这边可以尝试先使用单卡训练，然后将reader配置文件中的shuffle设置为false，观察是否每次训练都是同一张图导致的训练异常。另外也可以使用finetune的方式，也就是加载coco数据集上训练好的模型作为pretrain_weights

修改了shuffle设置为false，训练测试，loss_bbox_cls，loss_bbox_reg，loss随着迭代越来越大，最终变成nan，多次测试，出现nan的位置不同，均为第一轮：[ 580/3308] 时出现一次，[ 80/3308]出现一次，[ 20/3308]出现一次

jerrywgz commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

wxf764571829 commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

大概需要多少数据？

wxf764571829 commented 3 years ago

可以麻烦提供下少量的复现数据和配置文件吗

对应的部分数据和配置文件，刚刚有部分配置文件忘记给出来了这个是完整的链接：https://pan.baidu.com/share/init?surl=_uPu7T5ZNhOtlt3fm3pPsQ 提取码：myyk

jerrywgz commented 3 years ago

看了下你的数据类别是只有一类吗？

wxf764571829 commented 3 years ago

看了下你的数据类别是只有一类吗？

是的只有一类，num_class那里最开始设置的是一类，后面看到issue上，有大佬说要包含背景类就改成了两类

jerrywgz commented 3 years ago

类别数在动态图下统一是不包含背景类的，在静态图下类别数需要加1。调整成1之后，在你给的小数据集下训练正常。看上去bbox回归分支还比较小，可能受数据少的影响，模型还没有比较好的学习到正样本. 建议在大数据集下尝试，或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights

wxf764571829 commented 3 years ago

类别数在动态图下统一是不包含背景类的，在静态图下类别数需要加1。调整成1之后，在你给的小数据集下训练正常。看上去bbox回归分支还比较小，可能受数据少的影响，模型还没有比较好的学习到正样本. 建议在大数据集下尝试，或者使用coco数据集上训好的模型https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams 作为pretrain_weights

这是我修改后num_class==1其余保持和您一致的单卡训练，只不过是在大数据下，然而他依旧梯度爆炸，我在想这是否与我环境有关 ubuntu16.04

(paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c configs/faster_rcnn/faster_rcnn_r50fpn 1x_coco.yml /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddledet-2.1.0-py3.8.egg/ppdet/data/source/voc.py:92: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/91.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/61.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/5.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/62.jpg, and it will be ignored [06/10 10:49:24] ppdet.data.source.voc WARNING: Illegal image file: dataset/roadsign_voc/VOCdevkit2007/VOC2007/JPEGImages/931.jpg, and it will be ignored W0610 10:49:24.785656 9286 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1 W0610 10:49:24.788290 9286 device_context.cc:422] device: 0, cuDNN Version: 7.6. /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: [06/10 10:49:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/10 10:49:27] ppdet.engine INFO: Epoch: [0] [ 0/6615] learning_rate: 0.000010 loss_rpn_cls: 0.700178 loss_rpn_reg: 0.012245 loss_bbox_cls: 0.765692 loss_bbox_reg: 0.000132 loss: 1.478247 eta: 4:51:11 batch_cost: 0.2201 data_cost: 0.0002 ips: 4.5435 images/s [06/10 10:49:32] ppdet.engine INFO: Epoch: [0] [ 20/6615] learning_rate: 0.000011 loss_rpn_cls: 0.697189 loss_rpn_reg: 0.087577 loss_bbox_cls: 0.697171 loss_bbox_reg: 86374406375995382823944978432.000000 loss: 86374406375995382823944978432.000000 eta: 5:11:52 batch_cost: 0.2366 data_cost: 0.0002 ips: 4.2269 images/s [06/10 10:49:37] ppdet.engine INFO: Epoch: [0] [ 40/6615] learning_rate: 0.000011 loss_rpn_cls: 0.696582 loss_rpn_reg: 0.068122 loss_bbox_cls: 0.500473 loss_bbox_reg: 222237815671985718072440782848.000000 loss: 222237815671985718072440782848.000000 eta: 5:17:14 batch_cost: 0.2442 data_cost: 0.0002 ips: 4.0946 images/s ^CTraceback (most recent call last):

jerrywgz commented 3 years ago

链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据，可以放在PaddleDetection的目录下，然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml

wxf764571829 commented 3 years ago

链接: https://pan.baidu.com/s/15XAGG_VIgJG97MOQXMmcLg 密码: rp0v 这是我调试之后的数据，可以放在PaddleDetection的目录下，然后执行 python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml

结果如下：依然梯度爆炸 (paddle) ubuntu01@ubuntu01-System-Product-Name:/home/code/paddle/PaddleDetection-release-2.1$ python tools/train.py -c test_data/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: W0610 14:18:15.254403 11173 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1 W0610 14:18:15.257342 11173 device_context.cc:422] device: 0, cuDNN Version: 7.6. /home/ubuntu01/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: [06/10 14:18:17] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/ubuntu01/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams [06/10 14:18:17] ppdet.engine INFO: Epoch: [0] [ 0/30] learning_rate: 0.000010 loss_rpn_cls: 0.695479 loss_rpn_reg: 0.191895 loss_bbox_cls: 0.756262 loss_bbox_reg: 0.000104 loss: 1.643740 eta: 0:01:25 batch_cost: 0.2375 data_cost: 0.0002 ips: 4.2110 images/s [06/10 14:18:22] ppdet.engine INFO: Epoch: [0] [20/30] learning_rate: 0.000011 loss_rpn_cls: 0.699317 loss_rpn_reg: 0.043817 loss_bbox_cls: 0.656201 loss_bbox_reg: 30036017106796001248673792.000000 loss: 30036017106796001248673792.000000 eta: 0:01:17 batch_cost: 0.2276 data_cost: 0.0001 ips: 4.3931 images/s [06/10 14:18:28] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco [06/10 14:18:28] ppdet.engine INFO: Epoch: [1] [ 0/30] learning_rate: 0.000011 loss_rpn_cls: 0.699256 loss_rpn_reg: 0.015225 loss_bbox_cls: 0.527988 loss_bbox_reg: 1070040945795569183057053220864.000000 loss: 1070040945795569183057053220864.000000 eta: 0:01:14 batch_cost: 0.2237 data_cost: 0.0001 ips: 4.4710 images/s [06/10 14:18:33] ppdet.engine INFO: Epoch: [1] [20/30] learning_rate: 0.000012 loss_rpn_cls: 0.697815 loss_rpn_reg: 0.048037 loss_bbox_cls: 0.304180 loss_bbox_reg: 12224268613829214304034596323328.000000 loss: 12224268613829214304034596323328.000000 eta: 0:01:09 batch_cost: 0.2254 data_cost: 0.0001 ips: 4.4359 images/s [06/10 14:18:37] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco

jerrywgz commented 3 years ago

确认没有修改过其他代码吗？看上去如果bbox_reg分支过大，后续的训练肯定会影响其他分支，但是看起来rpn分支和bbox_cls分支的loss看起来还正常

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗？看上去如果bbox_reg分支过大，后续的训练肯定会影响其他分支，但是看起来rpn分支和bbox_cls分支的loss看起来还正常

确认没有修改过任何代码部分，我连代码的py文件都没有点开过，以下是最终的收尾： [06/10 14:22:31] ppdet.engine INFO: Epoch: [10] [ 0/30] learning_rate: 0.000019 loss_rpn_cls: 0.604167 loss_rpn_reg: 0.014151 loss_bbox_cls: 0.053558 loss_bbox_reg: 7307942676923507792863324798976.000000 loss: 7307942676923507792863324798976.000000 eta: 0:00:13 batch_cost: 0.2378 data_cost: 0.0001 ips: 4.2050 images/s [06/10 14:22:36] ppdet.engine INFO: Epoch: [10] [20/30] learning_rate: 0.000020 loss_rpn_cls: 0.581559 loss_rpn_reg: 0.040997 loss_bbox_cls: 0.042619 loss_bbox_reg: 7754301059115801562119164395520.000000 loss: 7754301059115801562119164395520.000000 eta: 0:00:09 batch_cost: 0.2465 data_cost: 0.0002 ips: 4.0574 images/s [06/10 14:22:41] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco [06/10 14:22:41] ppdet.engine INFO: Epoch: [11] [ 0/30] learning_rate: 0.000020 loss_rpn_cls: 0.567019 loss_rpn_reg: 0.023953 loss_bbox_cls: 0.040032 loss_bbox_reg: 7839933506159474397765714313216.000000 loss: 7839933506159474397765714313216.000000 eta: 0:00:06 batch_cost: 0.2382 data_cost: 0.0002 ips: 4.1977 images/s [06/10 14:22:46] ppdet.engine INFO: Epoch: [11] [20/30] learning_rate: 0.000021 loss_rpn_cls: 0.532371 loss_rpn_reg: 0.036571 loss_bbox_cls: 0.036838 loss_bbox_reg: 9653358503356006598700219498496.000000 loss: 9653358503356006598700219498496.000000 eta: 0:00:02 batch_cost: 0.2492 data_cost: 0.0002 ips: 4.0136 images/s [06/10 14:22:50] ppdet.utils.checkpoint INFO: Save checkpoint: output/faster_rcnn_r50_fpn_1x_coco

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗？看上去如果bbox_reg分支过大，后续的训练肯定会影响其他分支，但是看起来rpn分支和bbox_cls分支的loss看起来还正常

所以我一直找不到原因，然后2.1版本更新了ubuntu16.04环境，所以才在想是不是我本身环境哪里有问题

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗？看上去如果bbox_reg分支过大，后续的训练肯定会影响其他分支，但是看起来rpn分支和bbox_cls分支的loss看起来还正常

为了确认没有修改代码，我重新按照快速按装文档操作了一遍，用您那边给我传回的数据来训练，依然出现梯度爆炸

wxf764571829 commented 3 years ago

确认没有修改过其他代码吗？看上去如果bbox_reg分支过大，后续的训练肯定会影响其他分支，但是看起来rpn分支和bbox_cls分支的loss看起来还正常

想问下，您那边整体环境是多少，这边如果确实没办法我就尝试重新按照环境了

jerrywgz commented 3 years ago

我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的，可以尝试到AI Studio上或者使用docker进行训练，环境相对稳定一些

wxf764571829 commented 3 years ago

我这边是CUDA10.2 cuDNN7.6.5 Paddle 2.1版本训练的，可以尝试到AI Studio上或者使用docker进行训练，环境相对稳定一些

这边因为驱动原因只能安装CUDA10.1 其余环境和您的保持一致后训练，问题依旧存在

PaddlePaddle / PaddleDetection

学习率的疑惑，loss随着学习不断变大，最终为nan ，2.1版本，2GPU，lr0.00025，其余除了类个数均未修改 #3326