A bottleneck occurs when loading data from 1epoch. (PP-OCRv3)

System Environment：Ubuntu 18.04 | CUDA 11.0
Version：Paddle：2.3.0.post110 PaddleOCR：2.5.0.3 Related components：
Command Code：python -m paddle.distributed.launch --gpus '0' tools/train.py -c {my_yaml_file}
Complete Error Message：

First of all, thank you for your awesome project! There was a problem while learning the rec model, so I ask for your help.

When learning with a very large data set, a bottleneck occurs when loading data in 1epoch. The image below is a graph of GPU utilization when learning 1epoch. It can be seen that the GPU is used normally when evaluating, but the GPU utilization decreases again when training. However, after 1epoch, all data is loaded into memory and the GPU operates normally. If the dataset is small, it is good at using GPU from the beginning.

I want to resolve the bottleneck. Can you solve the bottleneck?

Finally, I attach the contents of my yaml file. Thanks!

Global:
  debug: false
  use_gpu: true
  epoch_num: 10
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: {my_save_model_dir}
  save_epoch_step: 2
  eval_batch_step: [0, 2000]
  cal_metric_during_train: true
  pretrained_model:
  checkpoints:
  save_inference_dir:
  use_visualdl: false
  infer_img: 
  character_dict_path: {my_character_dict_path}
  max_text_length: &max_text_length 25
  infer_mode: false
  use_space_char: false
  distributed: true
  save_res_path: {my_save_res_path}

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 5
  regularizer:
    name: L2
    factor: 3.0e-05

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
  Backbone:
    name: MobileNetV1Enhance
    scale: 0.5
    last_conv_stride: [1, 2]
    last_pool_type: avg
  Head:
    name: MultiHead
    head_list:
      - CTCHead:
          Neck:
            name: svtr
            dims: 64
            depth: 2
            hidden_dims: 120
            use_guide: True
          Head:
            fc_decay: 0.00001
      - SARHead:
          enc_dim: 512
          max_text_length: *max_text_length

Loss:
  name: MultiLoss
  loss_config_list:
    - CTCLoss:
    - SARLoss:

PostProcess:  
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: False

Train:
  dataset:
    name: SimpleDataSet
    data_dir: {my_data_dir}
    ext_op_transform_idx: 1
    label_file_list:
    - {my_label_file_list}
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - RecConAug:
        prob: 0.5
        ext_data_num: 2
        image_shape: [48, 320, 3]
    - RecAug:
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
    shuffle: true
    batch_size_per_card: 32
    drop_last: true
    num_workers: 4
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: {my_data_dir}
    label_file_list:
    - {my_label_file_list}
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 32
    num_workers: 4

PaddlePaddle / PaddleOCR

A bottleneck occurs when loading data from 1epoch. (PP-OCRv3) #6800