PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.71k stars 7.68k forks source link

A bottleneck occurs when loading data from 1epoch. (PP-OCRv3) #6800

Closed cgyeon-muhayu closed 1 year ago

cgyeon-muhayu commented 2 years ago

First of all, thank you for your awesome project! There was a problem while learning the rec model, so I ask for your help.

When learning with a very large data set, a bottleneck occurs when loading data in 1epoch. The image below is a graph of GPU utilization when learning 1epoch. image It can be seen that the GPU is used normally when evaluating, but the GPU utilization decreases again when training. However, after 1epoch, all data is loaded into memory and the GPU operates normally. image If the dataset is small, it is good at using GPU from the beginning. image

I want to resolve the bottleneck. Can you solve the bottleneck?

Finally, I attach the contents of my yaml file. Thanks!

Global:
  debug: false
  use_gpu: true
  epoch_num: 10
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: {my_save_model_dir}
  save_epoch_step: 2
  eval_batch_step: [0, 2000]
  cal_metric_during_train: true
  pretrained_model:
  checkpoints:
  save_inference_dir:
  use_visualdl: false
  infer_img: 
  character_dict_path: {my_character_dict_path}
  max_text_length: &max_text_length 25
  infer_mode: false
  use_space_char: false
  distributed: true
  save_res_path: {my_save_res_path}

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 5
  regularizer:
    name: L2
    factor: 3.0e-05

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
  Backbone:
    name: MobileNetV1Enhance
    scale: 0.5
    last_conv_stride: [1, 2]
    last_pool_type: avg
  Head:
    name: MultiHead
    head_list:
      - CTCHead:
          Neck:
            name: svtr
            dims: 64
            depth: 2
            hidden_dims: 120
            use_guide: True
          Head:
            fc_decay: 0.00001
      - SARHead:
          enc_dim: 512
          max_text_length: *max_text_length

Loss:
  name: MultiLoss
  loss_config_list:
    - CTCLoss:
    - SARLoss:

PostProcess:  
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: False

Train:
  dataset:
    name: SimpleDataSet
    data_dir: {my_data_dir}
    ext_op_transform_idx: 1
    label_file_list:
    - {my_label_file_list}
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - RecConAug:
        prob: 0.5
        ext_data_num: 2
        image_shape: [48, 320, 3]
    - RecAug:
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
    shuffle: true
    batch_size_per_card: 32
    drop_last: true
    num_workers: 4
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: {my_data_dir}
    label_file_list:
    - {my_label_file_list}
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 32
    num_workers: 4
littletomatodonkey commented 2 years ago

you can

  1. pull the latest code to avoid re-initializing dataloader at each epoch(link)
  2. set Global.cal_metric_during_train as False
  3. disable RecConAug for speed-up.