Recognition accuracy won't increase beyond 50%

aspaul20 commented 1 month ago

🔎 Search before asking

[x] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

I am trying to train the ch_PP-OCRv4_rec.yml recognition model on a custom dataset, which has very long images, up to 2480 pixels and ~50 height. There are also images with a lot of letters in them that I need to recognize, up to 135 chars per string. When training with the default configuration and max_text_length set to 140, I get an accuracy of 50% and the model never improves any further.

Upon some research, I saw that when max_text_length is increased, I should also increase the image width so that the image doesn't become too blurry. I also turned off any additional transformations like RecAug and RecConAug because they were warping the images and making them unreadable. But the performance does not improve. My config now is as follows:

Global:
  debug: false
  use_gpu: true
  epoch_num: 250
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/waug
  save_epoch_step: 10
  eval_batch_step: [0, 2000]
  cal_metric_during_train: true
  pretrained_model: weights/ch/ch_PP-OCRv4_rec_train/student
  checkpoints:
  save_inference_dir:
  use_visualdl: false
  infer_img: doc/imgs_words/ch/word_1.jpg
  character_dict_path: ppocr/utils/ppocr_keys_v1.txt
  max_text_length: &max_text_length 140
  infer_mode: false
  use_space_char: true
  distributed: true
  save_res_path: ./output/rec/predicts_ppocrv3.txt

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 5
  regularizer:
    name: L2
    factor: 3.0e-05

Architecture:
  model_type: rec
  algorithm: SVTR_LCNet
  Transform:
  Backbone:
    name: PPLCNetV3
    scale: 0.95
  Head:
    name: MultiHead
    head_list:
      - CTCHead:
          Neck:
            name: svtr
            dims: 120
            depth: 2
            hidden_dims: 120
            kernel_size: [1, 3]
            use_guide: True
          Head:
            fc_decay: 0.00001
      - NRTRHead:
          nrtr_dim: 384
          max_text_length: *max_text_length

Loss:
  name: MultiLoss
  loss_config_list:
    - CTCLoss:
    - NRTRLoss:

PostProcess:  
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc

Train:
  dataset:
    name: MultiScaleDataSet
    ds_width: false
    data_dir: base_sents
    ext_op_transform_idx: 1
    label_file_list:
    - base_sents/base_sents.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        - valid_ratio
  sampler:
    name: MultiScaleSampler
    scales: [[640, 32], [640, 48], [640, 64]]
    first_bs: &bs 24
    fix_bs: false
    divided_factor: [8, 16] # w, h
    is_training: True
  loader:
    shuffle: true
    batch_size_per_card: *bs
    drop_last: true
    num_workers: 8
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: base_sents
    label_file_list:
    - base_sents/base_sents.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        - valid_ratio
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 2
    num_workers: 4

Current epoch step (which won't improve much regardless of how many epochs): [2024/10/02 12:12:08] ppocr INFO: epoch: [31/250], global_step: 830, lr: 0.000975, acc: 0.416666, norm_edit_dis: 0.437640, CTCLoss: 0.020622, NRTRLoss: 1.209766, loss: 1.234407, avg_reader_cost: 0.00011 s, avg_batch_cost: 0.47500 s, avg_samples: 18.0, ips: 37.89484 samples/s, eta: 0:50:12, max_mem_reserved: 7172 MB, max_mem_allocated: 6961 MB

About the dataset: I have about ~450 original images, each of which I have augmented 100x to get ~45,000 samples. Results don't change whether I use the augmented dataset or the original dataset. Even overfitting on this train data would be satisfactory but the model does not get there either. Please help

🏃‍♂️ Environment (运行环境)

    Ubuntu 22.04 
    Python 3.10.14
    paddleocr 2.8.1
    paddlepaddle-gpu 2.6.2

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python3 tools/train.py -c configs/rec/PP-OCRv4/ch_PP-OCRv4_rec.yml

I cannot share the dataset.

jingsongliujing commented 3 weeks ago

It is suggested to try fine-tuning the model instead of retraining it.：https://paddlepaddle.github.io/PaddleOCR/en/ppocr/model_train/finetune.html

aspaul20 commented 2 weeks ago

@jingsongliujing I believe I am finetuning the model. See config line: pretrained_model: weights/ch/ch_PP-OCRv4_rec_train/student

PaddlePaddle / PaddleOCR