PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.82k stars 2.89k forks source link

关于ppyolo训练很慢很慢的情况 #2680

Closed shuxsu closed 2 years ago

shuxsu commented 3 years ago

max_iters=13000 batch size=12 work num=1 2 4 8 16都试过 时间没有太大区别 bf_size=1 2都试过区别不大

13000次迭代需要20个小时 版本1.84 Detection版本2.01 用的aistudio的v100跑的项目 图像数量训练集260张 图片尺寸在1440x1440之间

shuxsu commented 3 years ago

为什么两个同样配置 全部都一样的数据集 一个10小时训练完 2000张训练图片 另一个2天训练完 300张训练图片 什么原因导致 小数据集反而训练的更慢呢 迭代次数、bs 等全部一致

liuhuiCNN commented 3 years ago

您好,麻烦您确认下两次都是使用GPU训练的吗?按照您的描述,听起来训练速递差距很大,很不合理。

shuxsu commented 3 years ago

都是aistudio的GPU训练 配置都一样 只是数据集不一样 ------------------ 原始邮件 ------------------ @.> 发送时间: 2021年4月17日(星期六) 晚上11:29 @.>; @.**@.>; 主题: Re: [PaddlePaddle/PaddleDetection] 关于ppyolo训练很慢很慢的情况 (#2680)

shuxsu commented 3 years ago

您好,麻烦您确认下两次都是使用GPU训练的吗?按照您的描述,听起来训练速递差距很大,很不合理。

您能给我解答下为什么这么慢的原因吗?

shuxsu commented 3 years ago
architecture: YOLOv3
use_gpu: true
max_iters: 80000
log_iter: 20
save_dir: output
snapshot_iter: 4000
metric: COCO
pretrain_weights: https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
weights: output/ppyolo/model_final
num_classes: 2
use_fine_grained_loss: true
use_ema: true
ema_decay: 0.9998

YOLOv3:
  backbone: ResNet
  yolo_head: YOLOv3Head
  use_fine_grained_loss: true

ResNet:
  norm_type: sync_bn
  freeze_at: 0
  freeze_norm: false
  norm_decay: 0.
  depth: 50
  feature_maps: [3, 4, 5]
  variant: d
  dcn_v2_stages: [5]

YOLOv3Head:
  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
  anchors: [[10, 13], [16, 30], [33, 23],
            [30, 61], [62, 45], [59, 119],
            [116, 90], [156, 198], [373, 326]]
  norm_decay: 0.
  coord_conv: true
  iou_aware: true
  iou_aware_factor: 0.4
  scale_x_y: 1.05
  spp: true
  yolo_loss: YOLOv3Loss
  nms: MatrixNMS
  drop_block: true

YOLOv3Loss:
  ignore_thresh: 0.7
  scale_x_y: 1.05
  label_smooth: false
  use_fine_grained_loss: true
  iou_loss: DiouLossYolo
  iou_aware_loss: IouAwareLoss

DiouLossYolo:
  loss_weight: 1.0
  max_height: 608
  max_width: 608

IouAwareLoss:
  loss_weight: 1.0
  max_height: 608
  max_width: 608

MatrixNMS:
    background_label: -1
    keep_top_k: 100
    normalized: false
    score_threshold: 0.01
    post_threshold: 0.01

LearningRate:
  base_lr: 0.000125
  schedulers:
  - !PiecewiseDecay
    gamma: 0.1
    milestones:
    - 60000
    - 70000
  - !LinearWarmup
    start_factor: 0.
    steps: 1000

OptimizerBuilder:
  optimizer:
    momentum: 0.9
    type: Momentum
  regularizer:
    factor: 0.0005
    type: L2

_READER_: 'ppyolo_g_reader.yml'
shuxsu commented 3 years ago
TrainReader:
  inputs_def:
    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
    num_max_boxes: 50
  dataset:
    !COCODataSet
      image_dir: train
      anno_path: annotations/instance_train.json
      dataset_dir: /home/aistudio/work/green
      with_background: false
  sample_transforms:
    - !DecodeImage
      to_rgb: True
      with_mixup: True
    - !MixupImage
      alpha: 1.5
      beta: 1.5
    - !RandomFlipImage
      is_normalized: false
    - !NormalizeBox {}
    - !PadBox
      num_max_boxes: 50
    - !BboxXYXY2XYWH {}
  batch_transforms:
  - !RandomShape
    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
    random_inter: True
  - !NormalizeImage
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
    is_scale: True
    is_channel_first: false
  - !Permute
    to_bgr: false
    channel_first: True
  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
  # this operator will be deleted automatically if use_fine_grained_loss
  # is set as false
  - !Gt2YoloTarget
    anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
    anchors: [[10, 13], [16, 30], [33, 23],
              [30, 61], [62, 45], [59, 119],
              [116, 90], [156, 198], [373, 326]]
    downsample_ratios: [32, 16, 8]
  batch_size: 4
  shuffle: true
  mixup_epoch: 150
  drop_last: true
  worker_num: 1
  bufsize: 1
  use_process: true

EvalReader:
  inputs_def:
    fields: ['image', 'im_size', 'im_id']
    num_max_boxes: 50
  dataset:
    !COCODataSet
      image_dir: val
      anno_path: annotations/instance_val.json
      dataset_dir: /home/aistudio/work/green
      with_background: false
  sample_transforms:
    - !DecodeImage
      to_rgb: True
    - !ResizeImage
      target_size: 608
      interp: 2
    - !NormalizeImage
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]
      is_scale: True
      is_channel_first: false
    - !PadBox
      num_max_boxes: 50
    - !Permute
      to_bgr: false
      channel_first: True
  batch_size: 2
  drop_empty: false
  worker_num: 1
  bufsize: 1

TestReader:
  inputs_def:
    image_shape: [3, 608, 608]
    fields: ['image', 'im_size', 'im_id']
  dataset:
    !ImageFolder
      anno_path: /home/aistudio/work/green/annotations/instance_test.json
      with_background: false
  sample_transforms:
    - !DecodeImage
      to_rgb: True
    - !ResizeImage
      target_size: 608
      interp: 2
    - !NormalizeImage
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]
      is_scale: True
      is_channel_first: false
    - !Permute
      to_bgr: false
      channel_first: True
  batch_size: 1
shuxsu commented 3 years ago

batch size=1能快点 但是效果很差,batch size=4

image

如何能加速啊

yghstill commented 3 years ago

@shuxsu 看你的描述全部都一样的数据集,一个慢一个快,图像大小前后有变化吗,图片size越大,数据预处理越耗时?评估训练快慢,看ips的指标,ips越高,训练速度越快。 看你的配置文件,将TrainReader中worker_num设置成8或16、bufsize设成8或16看是否有提速。

shuxsu commented 3 years ago

ips  6.03316 images/sec 问题是我训练cityscape那个尺寸大 1024x2048 但是时间还好 我训练自定义数据集 那个尺寸都没cityscape的大 但是反而时间更长  总batch size不是等于bf size x batch size吗 我bf size大 总batch size更大 那不是更耗时吗。。。

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.ips

yghstill commented 3 years ago

bufsize只是数据预处理的缓存大小,和batch size相关,可以等于总batch size。总的bacth size=batch_size * gpu卡数。 可以top -c看下运行时CPU利用率,ppyolo训练因为数据预处理多,本身比较耗时。

qingqing01 commented 3 years ago

@shuxsu 两个迭代轮数一样嘛? 默认的PP-YOLO在COCO上训练迭代轮数非常多,自己数据可以依据epoch数调少一些。 另外,如果数据可以公开的话,也可以发出来,我们测试看下。

shuxsu commented 3 years ago

所有配置信息一模一样  除了数据集不一样 自定义数据集和cityscape数据集的链接如下 cityscape:https://aistudio.baidu.com/aistudio/datasetdetail/79240 green:https://aistudio.baidu.com/aistudio/datasetdetail/80823

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

heavengate commented 3 years ago

看你TrainReader.worker_numTrainReader.bufsize都配置的1,YOLO系列模型预处理比较复杂,强依赖预处理多子进程并发加速,可以试下把worker_numbufsize开大一些,worker_num可以根据你的机器的CPU核数,不超过核数可以尽量开发,bufsuze可以设置为训练卡数或者卡数的两倍 另外,如果是小数据集,建议不开始image mixup,可以把TrainReader.mixup_epoch设置为0

617475jordan commented 3 years ago

我也发现了,用ppyolo_r50vd_dc_voc,yml训练只要三四个小时,而ppyolov2_r50vd_dcn_voc.yml需要一天多,不知道什么原因。机器配置是i9-10900k,rtx3090,128g内存

ShawnXsw commented 3 years ago

v2.3分支,ppyolo-tiny 也遇到了, 预处理效率低,gpu长时间无占用。 硬盘有阵列卡,不太应该时IO卡顿,CPU金牌且cpu占用也很低,worker_num: 8。偶尔还有cpu、gpu占用均长时间为0%情况

paddle-bot-old[bot] commented 2 years ago

Since this issue has not been updated for more than three months, it will be closed, if it is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于该问题超过三个月未更新,将会被关闭,若问题未解决或有后续问题,请随时重新打开(建议先拉取最新代码进行尝试),我们会继续跟进。

zhoushuang66 commented 1 year ago

我使用release/2.0-rc分支的ppyolo, configs/ppyolo/ppyolo_reader.yml中当设置batch_size=4,预计训练时间是40h,默认batch_size=24,预计训练时间11x24h,为什么训练时间差距那么大?

bnbncch commented 4 months ago

我的版本一开始使用的是PP-YOLO的release2.5.2,paddle2.2.2-gpu。cuda11.3,ununtu20.04。然后出现了训练速度很慢的情况,GPU后面在刷别的issue的时候,有个大哥提醒了,是cuda版本的问题,后面同样的配置用cuda10.2,ubuntu18.04试了发现速度果然就变快了很多。