PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.12k stars 5.55k forks source link

V100显卡, PaddleOCR仓库中DBNet模型训练,单卡波动性能波动较大,多卡线性加速比低,初步怀疑与dataloader实现有关 #58713

Open caizhi-mt opened 11 months ago

caizhi-mt commented 11 months ago

bug描述 Describe the Bug

paddle 镜像: paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 动态图模式 训练paddleOCR中DBNet模型:https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/algorithm_det_db.md

单卡性能波动较大,使能同步模式,profiler图中看到,timeline图中存在空泡,初步定位是:在计算函数结束后,重新获取gil锁时,耗时会随机很久。 image

测试命令: python tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained

其他补充信息 Additional Supplementary Information

`[2023/11/06 16:26:21] ppocr INFO: Architecture : [2023/11/06 16:26:21] ppocr INFO: Backbone : [2023/11/06 16:26:21] ppocr INFO: model_name : large [2023/11/06 16:26:21] ppocr INFO: name : MobileNetV3 [2023/11/06 16:26:21] ppocr INFO: scale : 0.5 [2023/11/06 16:26:21] ppocr INFO: Head : [2023/11/06 16:26:21] ppocr INFO: k : 50 [2023/11/06 16:26:21] ppocr INFO: name : DBHead [2023/11/06 16:26:21] ppocr INFO: Neck : [2023/11/06 16:26:21] ppocr INFO: name : DBFPN [2023/11/06 16:26:21] ppocr INFO: out_channels : 256 [2023/11/06 16:26:21] ppocr INFO: Transform : None [2023/11/06 16:26:21] ppocr INFO: algorithm : DB [2023/11/06 16:26:21] ppocr INFO: model_type : det [2023/11/06 16:26:21] ppocr INFO: Eval : [2023/11/06 16:26:21] ppocr INFO: dataset : [2023/11/06 16:26:21] ppocr INFO: data_dir : ./train_data/icdar2015/text_localization/ [2023/11/06 16:26:21] ppocr INFO: label_file_list : ['./train_data/icdar2015/text_localization/test_icdar2015_label.txt'] [2023/11/06 16:26:21] ppocr INFO: name : SimpleDataSet [2023/11/06 16:26:21] ppocr INFO: transforms : [2023/11/06 16:26:21] ppocr INFO: DecodeImage : [2023/11/06 16:26:21] ppocr INFO: channel_first : False [2023/11/06 16:26:21] ppocr INFO: img_mode : BGR [2023/11/06 16:26:21] ppocr INFO: DetLabelEncode : None [2023/11/06 16:26:21] ppocr INFO: DetResizeForTest : [2023/11/06 16:26:21] ppocr INFO: image_shape : [736, 1280] [2023/11/06 16:26:21] ppocr INFO: NormalizeImage : [2023/11/06 16:26:21] ppocr INFO: mean : [0.485, 0.456, 0.406] [2023/11/06 16:26:21] ppocr INFO: order : hwc [2023/11/06 16:26:21] ppocr INFO: scale : 1./255. [2023/11/06 16:26:21] ppocr INFO: std : [0.229, 0.224, 0.225] [2023/11/06 16:26:21] ppocr INFO: ToCHWImage : None [2023/11/06 16:26:21] ppocr INFO: KeepKeys : [2023/11/06 16:26:21] ppocr INFO: keep_keys : ['image', 'shape', 'polys', 'ignore_tags'] [2023/11/06 16:26:21] ppocr INFO: loader : [2023/11/06 16:26:21] ppocr INFO: batch_size_per_card : 1 [2023/11/06 16:26:21] ppocr INFO: drop_last : False [2023/11/06 16:26:21] ppocr INFO: num_workers : 8 [2023/11/06 16:26:21] ppocr INFO: shuffle : False [2023/11/06 16:26:21] ppocr INFO: use_shared_memory : False [2023/11/06 16:26:21] ppocr INFO: Global : [2023/11/06 16:26:21] ppocr INFO: cal_metric_during_train : False [2023/11/06 16:26:21] ppocr INFO: checkpoints : None [2023/11/06 16:26:21] ppocr INFO: distributed : False [2023/11/06 16:26:21] ppocr INFO: epoch_num : 1200 [2023/11/06 16:26:21] ppocr INFO: eval_batch_step : [0, 2000] [2023/11/06 16:26:21] ppocr INFO: infer_img : doc/imgs_en/img_10.jpg [2023/11/06 16:26:21] ppocr INFO: log_smooth_window : 20 [2023/11/06 16:26:21] ppocr INFO: pretrained_model : ./pretrain_models/MobileNetV3_large_x0_5_pretrained [2023/11/06 16:26:21] ppocr INFO: print_batch_step : 1 [2023/11/06 16:26:21] ppocr INFO: save_epoch_step : 1200 [2023/11/06 16:26:21] ppocr INFO: save_inference_dir : None [2023/11/06 16:26:21] ppocr INFO: save_model_dir : ./output/db_mv3/ [2023/11/06 16:26:21] ppocr INFO: save_res_path : ./output/det_db/predicts_db.txt [2023/11/06 16:26:21] ppocr INFO: test_performance : False [2023/11/06 16:26:21] ppocr INFO: use_gpu : True [2023/11/06 16:26:21] ppocr INFO: use_visualdl : False [2023/11/06 16:26:21] ppocr INFO: use_xpu : False [2023/11/06 16:26:21] ppocr INFO: Loss : [2023/11/06 16:26:21] ppocr INFO: alpha : 5 [2023/11/06 16:26:21] ppocr INFO: balance_loss : True [2023/11/06 16:26:21] ppocr INFO: beta : 10 [2023/11/06 16:26:21] ppocr INFO: main_loss_type : DiceLoss [2023/11/06 16:26:21] ppocr INFO: name : DBLoss [2023/11/06 16:26:21] ppocr INFO: ohem_ratio : 3 [2023/11/06 16:26:21] ppocr INFO: Metric : [2023/11/06 16:26:21] ppocr INFO: main_indicator : hmean [2023/11/06 16:26:21] ppocr INFO: name : DetMetric [2023/11/06 16:26:21] ppocr INFO: Optimizer : [2023/11/06 16:26:21] ppocr INFO: beta1 : 0.9 [2023/11/06 16:26:21] ppocr INFO: beta2 : 0.999 [2023/11/06 16:26:21] ppocr INFO: lr : [2023/11/06 16:26:21] ppocr INFO: learning_rate : 0.001 [2023/11/06 16:26:21] ppocr INFO: name : Adam [2023/11/06 16:26:21] ppocr INFO: regularizer : [2023/11/06 16:26:21] ppocr INFO: factor : 0 [2023/11/06 16:26:21] ppocr INFO: name : L2 [2023/11/06 16:26:21] ppocr INFO: PostProcess : [2023/11/06 16:26:21] ppocr INFO: box_thresh : 0.6 [2023/11/06 16:26:21] ppocr INFO: max_candidates : 1000 [2023/11/06 16:26:21] ppocr INFO: name : DBPostProcess [2023/11/06 16:26:21] ppocr INFO: thresh : 0.3 [2023/11/06 16:26:21] ppocr INFO: unclip_ratio : 1.5 [2023/11/06 16:26:21] ppocr INFO: Train : [2023/11/06 16:26:21] ppocr INFO: dataset : [2023/11/06 16:26:21] ppocr INFO: data_dir : ./train_data/icdar2015/text_localization/ [2023/11/06 16:26:21] ppocr INFO: label_file_list : ['./train_data/icdar2015/text_localization/train_icdar2015_label.txt'] [2023/11/06 16:26:21] ppocr INFO: name : SimpleDataSet [2023/11/06 16:26:21] ppocr INFO: ratio_list : [1.0] [2023/11/06 16:26:21] ppocr INFO: transforms : [2023/11/06 16:26:21] ppocr INFO: DecodeImage : [2023/11/06 16:26:21] ppocr INFO: channel_first : False [2023/11/06 16:26:21] ppocr INFO: img_mode : BGR [2023/11/06 16:26:21] ppocr INFO: DetLabelEncode : None [2023/11/06 16:26:21] ppocr INFO: IaaAugment : [2023/11/06 16:26:21] ppocr INFO: augmenter_args : [2023/11/06 16:26:21] ppocr INFO: args : [2023/11/06 16:26:21] ppocr INFO: p : 0.5 [2023/11/06 16:26:21] ppocr INFO: type : Fliplr [2023/11/06 16:26:21] ppocr INFO: args : [2023/11/06 16:26:21] ppocr INFO: rotate : [-10, 10] [2023/11/06 16:26:21] ppocr INFO: type : Affine [2023/11/06 16:26:21] ppocr INFO: args : [2023/11/06 16:26:21] ppocr INFO: size : [0.5, 3] [2023/11/06 16:26:21] ppocr INFO: type : Resize [2023/11/06 16:26:21] ppocr INFO: EastRandomCropData : [2023/11/06 16:26:21] ppocr INFO: keep_ratio : True [2023/11/06 16:26:21] ppocr INFO: max_tries : 50 [2023/11/06 16:26:21] ppocr INFO: size : [640, 640] [2023/11/06 16:26:21] ppocr INFO: MakeBorderMap : [2023/11/06 16:26:21] ppocr INFO: shrink_ratio : 0.4 [2023/11/06 16:26:21] ppocr INFO: thresh_max : 0.7 [2023/11/06 16:26:21] ppocr INFO: thresh_min : 0.3 [2023/11/06 16:26:21] ppocr INFO: MakeShrinkMap : [2023/11/06 16:26:21] ppocr INFO: min_text_size : 8 [2023/11/06 16:26:21] ppocr INFO: shrink_ratio : 0.4 [2023/11/06 16:26:21] ppocr INFO: NormalizeImage : [2023/11/06 16:26:21] ppocr INFO: mean : [0.485, 0.456, 0.406] [2023/11/06 16:26:21] ppocr INFO: order : hwc [2023/11/06 16:26:21] ppocr INFO: scale : 1./255. [2023/11/06 16:26:21] ppocr INFO: std : [0.229, 0.224, 0.225] [2023/11/06 16:26:21] ppocr INFO: ToCHWImage : None [2023/11/06 16:26:21] ppocr INFO: KeepKeys : [2023/11/06 16:26:21] ppocr INFO: keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] [2023/11/06 16:26:21] ppocr INFO: loader : [2023/11/06 16:26:21] ppocr INFO: batch_size_per_card : 30 [2023/11/06 16:26:21] ppocr INFO: drop_last : False [2023/11/06 16:26:21] ppocr INFO: num_workers : 8 [2023/11/06 16:26:21] ppocr INFO: shuffle : True [2023/11/06 16:26:21] ppocr INFO: use_shared_memory : False [2023/11/06 16:26:21] ppocr INFO: profiler_options : None [2023/11/06 16:26:21] ppocr INFO: train with paddle 2.5.1 and device Place(gpu:0) [2023/11/06 16:26:21] ppocr INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/train_icdar2015_label.txt'] [2023/11/06 16:26:21] ppocr INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/test_icdar2015_label.txt'] W1106 16:26:21.977221 1275 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.2, Runtime API Version: 12.0 W1106 16:26:21.978426 1275 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9. [2023/11/06 16:26:22] ppocr INFO: load pretrain successful from ./pretrain_models/MobileNetV3_large_x0_5_pretrained [2023/11/06 16:26:22] ppocr INFO: train dataloader has 34 iters [2023/11/06 16:26:22] ppocr INFO: valid dataloader has 500 iters [2023/11/06 16:26:22] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations

`

1699288310721_0AA3E363-1DD0-4c3f-B79D-095078363661

No response

onecatcn commented 11 months ago

社区有反馈dataloader在每个epoch前会重新初始化,绕过初始化后正常。请看是否同样的问题。 可以参考这个issue:https://github.com/PaddlePaddle/PaddleOCR/issues/11160

caizhi-mt commented 11 months ago

社区有反馈dataloader在每个epoch前会重新初始化,绕过初始化后正常。请看是否同样的问题。 可以参考这个issue:PaddlePaddle/PaddleOCR#11160

和这个issue描述的不是同一个问题,按照这个issue绕过方法测试过,依然存在单卡波动比较大的情况