PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.65k stars 2.87k forks source link

[BUG]單卡訓練pp-yoloe時卡住 #5735

Closed jimmy133719 closed 2 years ago

jimmy133719 commented 2 years ago

PaddleDetection team appreciate any suggestion or problem you delivered~

Checklist:

  1. 查找历史相关issue寻求解答/I have searched related issues but cannot get the expected help.
  2. 翻阅FAQ /I have read the FAQ documentation but cannot get the expected help.
  3. 确认bug是否在新版本里还未修复/The bug has not been fixed in the latest version.

描述问题/Describe the bug

A clear and concise description of what the bug is. 單卡訓練pp-yoloe時會遇到hang住的問題,有時候是訓練多個epoch後才遇到,有時是訓練完一個epoch就遇到,且每次hang住都是發生在新的epoch剛開始。 Screenshot from 2022-04-18 14-52-08

复现/Reproduction

  1. 您使用的命令是?/What command or script did you run?
python tools/train.py -c RTK_configs/ppyoloe/ppyoloe_crn_s_300e_coco_Sigmoid.yml -r output/ppyoloe_crn_s_300e_coco_Sigmoid/53 --use_vdl=true --vdl_log_dir=vdl_dir/scalar
  1. 您是否更改过代码或配置文件?您是否理解您所更改的内容?还请您提供所更改的部分代码。/Did you make any modifications on the code or config? Did you understand what you have modified? Please provide the codes that you modified. 將snapshot_epoch從10改成1(有試過改成其他數字仍有相同問題),batch_size從32改成8,base_lr從0.04改成0.00125(8張顯卡改成單卡&batch_size縮小成1/4)

  2. 您使用的数据集是?/What dataset did you use? COCO

  3. 请提供您出现的报错信息及相关log。/Please provide the error messages or relevant log information.

环境/Environment

  1. 请提供您使用的Paddle和PaddleDetection的版本号/Please provide the version of Paddle and PaddleDetection you use: Paddle: 2.2.2.post111 PaddleDetection: 2.4.0

  2. 如您在使用PaddleDetection的同时还在使用其他产品,如PaddleServing、PaddleInference等,请您提供其版本号/ Please provide the version of any other related tools/products used, such as the version of PaddleServing and etc:

  3. 请提供您使用的操作系统信息,如Linux/Windows/MacOS /Please provide the OS information, e.g., Linux: Ubuntu18.04

  4. 请问您使用的Python版本是?/ Please provide the version of Python you used. 3.8.13

  5. 请问您使用的CUDA/cuDNN的版本号是?/ Please provide the version of CUDA/cuDNN you used. 11.1/8.2.1

如果您的issue是关于安装或环境,您可以先查询安装文档尝试解决~

If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/INSTALL.md

wangxinxin08 commented 2 years ago

可以发下你修改的配置文件吗?

jimmy133719 commented 2 years ago

可以发下你修改的配置文件吗?

我只改了ppyoloe_crn_s_300e_coco.yml,如下:

_BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
  './_base_/optimizer_300e.yml',
  './_base_/ppyoloe_crn.yml',
  './_base_/ppyoloe_reader.yml',
]

log_iter: 100
snapshot_epoch: 1
weights: output/ppyoloe_crn_s_300e_coco_Sigmoid_finetune/model_final

pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/CSPResNetb_s_pretrained.pdparams
depth_mult: 0.33
width_mult: 0.50

TrainReader:
  batch_size: 8

LearningRate:
  base_lr: 0.00125
wangxinxin08 commented 2 years ago

@jimmy133719 只修改了学习率是吗?

jimmy133719 commented 2 years ago

@wangxinxin08 不好意思,忘了說我模型裡EffectiveSELayer的部份我將Hard Sigmoid置換成Sigmoid,不確定這部份會不會是原因。修改如下: ppdet/modeling/backbones/cspresnet.py

# self.attn = EffectiveSELayer(ch_mid, act='hardsigmoid')
self.attn = EffectiveSELayer(ch_mid, act='sigmoid')

其餘部份皆未修改,謝謝!

wangxinxin08 commented 2 years ago

@jimmy133719 这一部分可以先不修改,因为backbone的预训练是hardsigmoid,不过这里应该不会导致hang住

jimmy133719 commented 2 years ago

@wangxinxin08 這邊會修改的原因是因為後續部署的環境不支援hardsigmoid

wangxinxin08 commented 2 years ago

@wangxinxin08 這邊會修改的原因是因為後續部署的環境不支援hardsigmoid

可以的,训练300个epoch的话影响应该不大

wangxinxin08 commented 2 years ago

如果方便的话,可以分享下数据以便更好的定位bug哈

jimmy133719 commented 2 years ago

請問希望提供什麼數據?

wangxinxin08 commented 2 years ago

請問希望提供什麼數據?

看错了,是COCO数据集,由于你是单卡训练的,按理说不会出现hang住的问题,你可以设置下worker_num为0进行训练,看下具体是什么原因导致的问题

jimmy133719 commented 2 years ago

可以发下你修改的配置文件吗?

我只改了ppyoloe_crn_s_300e_coco.yml,如下:

_BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
  './_base_/optimizer_300e.yml',
  './_base_/ppyoloe_crn.yml',
  './_base_/ppyoloe_reader.yml',
]

log_iter: 100
snapshot_epoch: 1
weights: output/ppyoloe_crn_s_300e_coco_Sigmoid_finetune/model_final

pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/CSPResNetb_s_pretrained.pdparams
depth_mult: 0.33
width_mult: 0.50

TrainReader:
  batch_size: 8

LearningRate:
  base_lr: 0.00125

@wangxinxin08 您好,想另外請問關於訓練精度的問題。 我嘗試對官方提供ppyoloe_crn_s_300e_coco的pretrained weight去做finetune,修改的配置文件如上,訓練後的AP以及loss趨勢如下圖,AP收斂後與ppyoloe_crn_s_300e_coco約差10%。原本以為是Hard Sigmoid換成Sigmoid的問題,但在維持Hard Sigmoid的情況下finetune也是會有AP往下掉的問題,因此不確定是不是learning rate設置的影響,目前單卡訓練batch size=8,所以learning rate=0.04x(8x1)/(32x8)=0.00125。

loss mAP

wangxinxin08 commented 2 years ago

@jimmy133719 用coco pretrain去训练coco吗?我们一般不推荐这么做。因为coco的预训练模型已经收敛,如果重新训练的话,学习率会出现一个突然增大的情况,导致精度下降,如果训练完整的300个epoch应该会涨回来