[BUG]單卡訓練pp-yoloe時卡住

jimmy133719 commented 2 years ago

PaddleDetection team appreciate any suggestion or problem you delivered~

Checklist:

查找历史相关issue寻求解答/I have searched related issues but cannot get the expected help.
翻阅FAQ /I have read the FAQ documentation but cannot get the expected help.
确认bug是否在新版本里还未修复/The bug has not been fixed in the latest version.

描述问题/Describe the bug

A clear and concise description of what the bug is. 單卡訓練pp-yoloe時會遇到hang住的問題，有時候是訓練多個epoch後才遇到，有時是訓練完一個epoch就遇到，且每次hang住都是發生在新的epoch剛開始。 Screenshot from 2022-04-18 14-52-08

复现/Reproduction

您使用的命令是？/What command or script did you run?

python tools/train.py -c RTK_configs/ppyoloe/ppyoloe_crn_s_300e_coco_Sigmoid.yml -r output/ppyoloe_crn_s_300e_coco_Sigmoid/53 --use_vdl=true --vdl_log_dir=vdl_dir/scalar

您是否更改过代码或配置文件？您是否理解您所更改的内容？还请您提供所更改的部分代码。/Did you make any modifications on the code or config? Did you understand what you have modified? Please provide the codes that you modified. 將snapshot_epoch從10改成1(有試過改成其他數字仍有相同問題)，batch_size從32改成8，base_lr從0.04改成0.00125(8張顯卡改成單卡＆batch_size縮小成1/4)
您使用的数据集是？/What dataset did you use? COCO
请提供您出现的报错信息及相关log。/Please provide the error messages or relevant log information.

环境/Environment

请提供您使用的Paddle和PaddleDetection的版本号/Please provide the version of Paddle and PaddleDetection you use： Paddle: 2.2.2.post111 PaddleDetection: 2.4.0
如您在使用PaddleDetection的同时还在使用其他产品，如PaddleServing、PaddleInference等，请您提供其版本号/ Please provide the version of any other related tools/products used, such as the version of PaddleServing and etc：
请提供您使用的操作系统信息，如Linux/Windows/MacOS /Please provide the OS information, e.g., Linux： Ubuntu18.04
请问您使用的Python版本是？/ Please provide the version of Python you used. 3.8.13
请问您使用的CUDA/cuDNN的版本号是？/ Please provide the version of CUDA/cuDNN you used. 11.1/8.2.1

如果您的issue是关于安装或环境，您可以先查询安装文档尝试解决~

If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/INSTALL.md

wangxinxin08 commented 2 years ago

可以发下你修改的配置文件吗？

jimmy133719 commented 2 years ago

可以发下你修改的配置文件吗？

我只改了ppyoloe_crn_s_300e_coco.yml，如下：

_BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
  './_base_/optimizer_300e.yml',
  './_base_/ppyoloe_crn.yml',
  './_base_/ppyoloe_reader.yml',
]

log_iter: 100
snapshot_epoch: 1
weights: output/ppyoloe_crn_s_300e_coco_Sigmoid_finetune/model_final

pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/CSPResNetb_s_pretrained.pdparams
depth_mult: 0.33
width_mult: 0.50

TrainReader:
  batch_size: 8

LearningRate:
  base_lr: 0.00125

wangxinxin08 commented 2 years ago

@jimmy133719 只修改了学习率是吗?

jimmy133719 commented 2 years ago

@wangxinxin08 不好意思，忘了說我模型裡EffectiveSELayer的部份我將Hard Sigmoid置換成Sigmoid，不確定這部份會不會是原因。修改如下： ppdet/modeling/backbones/cspresnet.py

# self.attn = EffectiveSELayer(ch_mid, act='hardsigmoid')
self.attn = EffectiveSELayer(ch_mid, act='sigmoid')

其餘部份皆未修改，謝謝！

wangxinxin08 commented 2 years ago

@jimmy133719 这一部分可以先不修改，因为backbone的预训练是hardsigmoid，不过这里应该不会导致hang住

jimmy133719 commented 2 years ago

@wangxinxin08 這邊會修改的原因是因為後續部署的環境不支援hardsigmoid

wangxinxin08 commented 2 years ago

@wangxinxin08 這邊會修改的原因是因為後續部署的環境不支援hardsigmoid

可以的，训练300个epoch的话影响应该不大

wangxinxin08 commented 2 years ago

如果方便的话，可以分享下数据以便更好的定位bug哈

jimmy133719 commented 2 years ago

請問希望提供什麼數據?

wangxinxin08 commented 2 years ago

請問希望提供什麼數據?

看错了，是COCO数据集，由于你是单卡训练的，按理说不会出现hang住的问题，你可以设置下worker_num为0进行训练，看下具体是什么原因导致的问题

jimmy133719 commented 2 years ago

可以发下你修改的配置文件吗？

我只改了ppyoloe_crn_s_300e_coco.yml，如下：
_BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
  './_base_/optimizer_300e.yml',
  './_base_/ppyoloe_crn.yml',
  './_base_/ppyoloe_reader.yml',
]

log_iter: 100
snapshot_epoch: 1
weights: output/ppyoloe_crn_s_300e_coco_Sigmoid_finetune/model_final

pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/CSPResNetb_s_pretrained.pdparams
depth_mult: 0.33
width_mult: 0.50

TrainReader:
  batch_size: 8

LearningRate:
  base_lr: 0.00125
@wangxinxin08 您好，想另外請問關於訓練精度的問題。我嘗試對官方提供ppyoloe_crn_s_300e_coco的pretrained weight去做finetune，修改的配置文件如上，訓練後的AP以及loss趨勢如下圖，AP收斂後與ppyoloe_crn_s_300e_coco約差10%。原本以為是Hard Sigmoid換成Sigmoid的問題，但在維持Hard Sigmoid的情況下finetune也是會有AP往下掉的問題，因此不確定是不是learning rate設置的影響，目前單卡訓練batch size=8，所以learning rate=0.04x(8x1)/(32x8)=0.00125。

loss mAP

wangxinxin08 commented 2 years ago

@jimmy133719 用coco pretrain去训练coco吗？我们一般不推荐这么做。因为coco的预训练模型已经收敛，如果重新训练的话，学习率会出现一个突然增大的情况，导致精度下降，如果训练完整的300个epoch应该会涨回来

PaddlePaddle / PaddleDetection