PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.4k stars 7.65k forks source link

Config of SVTR-CPPD (large) #11198

Closed trantuankhoi closed 2 months ago

trantuankhoi commented 10 months ago

I have successfully trained the SVTR-CPPD (base version) model and achieved excellent results. However, I would like to improve the metrics further, so I am experimenting with SVTR-large as the backbone. I have tried searching and applying other SVTR-large (original) configs like here, but the accuracy is still 0% after a period of training. In the same time frame, SVTR-base has reached about 10%. I think the problem lies in the head config. If anyone has tried SVTR-CPPD (large) before, could you let me know if I have any wrong configs?

My config here:


  use_gpu: True
  epoch_num: 20
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: /HDD/kedanhcaptraitim/CPPD/test
  save_epoch_step: 1
  # evaluation is run every 2000 iterations after the 0th iteration
  eval_batch_step: [0, 5000]
  cal_metric_during_train: True
  pretrained_model:
  checkpoints: /HDD/kedanhcaptraitim/CPPD/cppd_v1.2.0/best_accuracy
  save_inference_dir: ./resources/
  rec_model_dir:
  use_visualdl: True
  visualdl_file_name: vdlrecords
  infer_img: doc/imgs_words_en/word_10.png
  # for data or label process
  character_dict_path: ppocr/utils/dict/vietnamese_dict.txt
  character_type: korean
  max_text_length: 128
  infer_mode: False
  use_space_char: True
  save_res_path: ./output/rec/predicts_svtr_cppd_base.txt

Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
  epsilon: 1.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed char_node_embed pos_node_embed char_pos_embed vis_pos_embed
  one_dim_param_no_weight_decay: True
  lr:
    name: Cosine
    learning_rate: 0.000375 # 4gpus 256bs
    warmup_epoch: 4

Architecture:
  model_type: rec
  algorithm: CPPD
  Transform:
  Backbone:
    name: SVTRNet
    img_size: [32, 768]
    patch_merging: 'Conv'
    embed_dim: [192, 256, 512]
    depth: [6, 6, 9]
    num_heads: [6, 8, 16]
    mixer: ['Conv','Conv','Conv','Conv','Conv','Conv', 'Conv','Conv', 'Conv', 'Conv', 'Global','Global','Global','Global','Global','Global','Global','Global','Global','Global', 'Global']
    local_mixer: [[7, 11], [7, 11], [7, 11]]
    last_stage: False
    prenorm: True
  Head:
    name: CPPDHead
    dim: 512
    vis_seq: 384
    num_layer: 3
    max_len: 128

Loss:
  name: CPPDLoss
  ignore_index: &ignore_index 100 # must be greater than the number of character classes
  smoothing: True
  sideloss_weight: 1.0

PostProcess:
  name: CPPDLabelDecode

Metric:
  name: FschoolMetricEvaluation
  main_indicator: acc
  # Save prediction's log
  prediction_log: True # lưu json log
  # DOTS AND COMMAS config
  dots_and_commas: False
  max_of_difference_character: 2 # sai tối đa 2 ký tự
  # FILL THE BLANK config
  fill_the_blank: True
  threshold_variance : 0.5 # tỷ lệ độ dài chuỗi của prediction và target sẽ nằm trong khoảng [threshold_variance, 1]

Train:
  dataset:
    name: LMDBDataSet
    data_dir: /HDD/kedanhcaptraitim/Data/train/train_lmdb_v2.21.0
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CPPDLabelEncode: # Class handling label
          ignore_index: *ignore_index
      - SVTRRecResizeImg:
          image_shape: [3, 32, 768]
          padding: False
      - KeepKeys:
          keep_keys: ['image', 'label', 'label_node', 'length'] # dataloader will return list in this order
  loader:
    shuffle: True
    batch_size_per_card: 8
    drop_last: True
    num_workers: 2
    use_shared_memory: True

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: /HDD/kedanhcaptraitim/Data/test/private_test/FQA_v1_final_19.10
    label_file_list: [ "/HDD/kedanhcaptraitim/Data/test/private_test/FQA_v1_final_19.10/test.txt" ]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CPPDLabelEncode: # Class handling label
          ignore_index: *ignore_index
      - SVTRRecResizeImg:
          image_shape: [3, 32, 768]
          padding: True
      - KeepKeys:
          keep_keys: ['image', 'label', 'label_node','length'] # dataloader will return list in this order
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 8
    num_workers: 2
    use_shared_memory: True```
trantuankhoi commented 10 months ago

Hi @Topdu, can you take a look here pls? Thanks in advance

Topdu commented 10 months ago

checking the following config is right, pls:

learning_rate: 0.000375 # if using small batchsize, lr should be demoted.

ignore_index: &ignore_index 100 # must be greater than the number of character classes

  - SVTRRecResizeImg:
      image_shape: [3, 32, 768]
      padding: False # should be True according to eval config

local_mixer: [[7, 11], [7, 11], [7, 11]] # should be [[5, 5], [5, 5], [5, 5]] if using Conv mixer

trantuankhoi commented 10 months ago

Thanks for your reply and suggestions

My batchsize is 96, so I will set the lr = 0.000046875 (based on the paper). About my language dictionary, it has 233 characters, so I will set ignore_index: &ignore_index 234 from now. But I also used ignore_index 100 for my base config and it works, although I dont know why.

The other configs are correct for me. Any notes for head'config which I missed, pls

Topdu commented 10 months ago

head'config which is correct and lr may be too small and should be set no less than 0.0001 from experience. local_mixer: [[7, 11], [7, 11], [7, 11]] # should be [[5, 5], [5, 5], [5, 5]] if using Conv mixer

trantuankhoi commented 10 months ago

Thanks a lot. I will experience one more time follow your suggestions

trantuankhoi commented 3 months ago

Hi @Topdu

I'm currently training the SVTR_CPPD (base) model and noticed that the train/loss_edge metric is significantly higher than the train/loss_node metric. I'm not sure if this is normal behavior, and I was curious to see if you observed this phenomenon in your experiments as well.

image
Topdu commented 3 months ago

Yes! This phenomenon in ours experiments as well.