Topdu / OpenOCR

OpenOCR: A general OCR system with accuracy and efficiency. Supporting 24 Scene Text Recognition methods trained from scratch on large-scale real datasets, and will continue to add the latest methods.
Apache License 2.0
196 stars 20 forks source link

Support for Training with Extremely Long Images (e.g., [48, 800] or Aspect Ratio ≥ 16) #41

Open leduy-it opened 3 days ago

leduy-it commented 3 days ago

First of all, thank you for the outstanding work on this repository. I have been using your implementation to fine-tune a model for recognizing extremely long text in Chinese. My input image dimensions can reach up to [48, 800], with aspect ratios approximately ≥ 16. While your current configuration supports a maximum aspect ratio of 4, I have modified it to handle these longer sequences, and my results are promising with an accuracy of around 90%. However, I aim to boost this accuracy further, ideally above 95%, given that my dataset is relatively straightforward.

Specific Questions:

    - RecResizeImg:
        image_shape: [3, 48, 800]
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        - image
        - label_ctc
        - label_gtc
        - length
class RecResizeImg(object):
def __init__(
    self.image_shape = image_shape
    self.infer_mode = infer_mode
    self.eval_mode = eval_mode
    self.character_dict_path = character_dict_path
    self.padding = padding

def __call__(self, data):
    img = data["image"]
    if self.eval_mode or (self.infer_mode and self.character_dict_path is not None):
        norm_img, valid_ratio = resize_norm_img_chinese(img, self.image_shape)
        norm_img, valid_ratio = resize_norm_img(img, self.image_shape, self.padding)

    data["image"] = norm_img
    data["valid_ratio"] = valid_ratio
    return data

def resize_norm_img(img, image_shape, padding=True, interpolation=cv2.INTER_LINEAR): imgC, imgH, imgW = image_shape h = img.shape[0] w = img.shape[1] if not padding: resized_image = cv2.resize(img, (imgW, imgH), interpolation=interpolation) resized_w = imgW else: ratio = w / float(h) if math.ceil(imgH ratio) > imgW: resized_w = imgW else: resized_w = int(math.ceil(imgH ratio)) resized_image = cv2.resize(img, (resized_w, imgH)) resized_image = resized_image.astype("float32") if image_shape[0] == 1: resized_image = resized_image / 255 resized_image = resized_image[np.newaxis, :] else: resized_image = resized_image.transpose((2, 0, 1)) / 255 resized_image -= 0.5 resized_image /= 0.5 padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) padding_im[:, :, 0:resized_w] = resized_image valid_ratio = min(1.0, float(resized_w / imgW)) return padding_im, valid_ratio

def resize_norm_img_chinese(img, image_shape): imgC, imgH, imgW = image_shape

todo: change to 0 and modified image shape

max_wh_ratio = imgW * 1.0 / imgH
h, w = img.shape[0], img.shape[1]
ratio = w * 1.0 / h
max_wh_ratio = max(max_wh_ratio, ratio)
imgW = int(imgH * max_wh_ratio)
if math.ceil(imgH * ratio) > imgW:
    resized_w = imgW
    resized_w = int(math.ceil(imgH * ratio))
resized_image = cv2.resize(img, (resized_w, imgH))
resized_image = resized_image.astype("float32")
if image_shape[0] == 1:
    resized_image = resized_image / 255
    resized_image = resized_image[np.newaxis, :]
    resized_image = resized_image.transpose((2, 0, 1)) / 255
resized_image -= 0.5
resized_image /= 0.5
padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)
padding_im[:, :, 0:resized_w] = resized_image
valid_ratio = min(1.0, float(resized_w / imgW))
return padding_im, valid_ratio

Any guidance or insights you could provide would be greatly appreciated. Thank you again for maintaining and improving this fantastic project!

Topdu commented 2 days ago

Please post the config file here.

  1. How does the accuracy varies with text length on the test set?
  2. Are the text images from multiple scenes or a single scene, such as a document?
leduy-it commented 2 days ago
  1. I'll share my config for your review first. Once I evaluate accuracy across different text lengths on the test set with the new checkpoint, I'll update you.
  2. I'm working on a single scene, specifically documents.

And just to emphasize what i said before, I'm working with your PaddleOCR implementation, not this repo. I’ve noticed some differences, such as how input data resizing is handled. If there are significant differences between the two implementations, please let me know what modifications I would need to make to achieve performance comparable to this original repository.

  debug: false
  use_gpu: true
  epoch_num: 5
  log_smooth_window: 20
  print_batch_step: 500
  save_model_dir: output/rec_svtrv2_training_2811_02_48x800
  save_epoch_step: 10
  eval_batch_step: [0, 3000]
  cal_metric_during_train: True
  pretrained_model: output/rec_svtrv2_training_2811_01_48x800/best_accuracy.pdparams
  use_visualdl: false
  infer_img: doc/imgs_words/ch/word_1.jpg
  character_dict_path: ./cn_6843_dict.txt
  max_text_length: &max_text_length 100
  infer_mode: false
  use_space_char: true
  distributed: true
  save_res_path: ./output/rec/predicts_svrtv2.txt

  name: AdamW
  beta1: 0.9
  beta2: 0.999
  epsilon: 1.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm
  one_dim_param_no_weight_decay: True
    name: Cosine
    learning_rate: 0.00001 ###LR small because of i try to continue finetuning after change size width from 640 to 800 
    # warmup_epoch: 1
    warmup_steps: 23000 ####Modify epoch lr warm up epoch to step to continue finetuning after change size width from 640 to 800 

  model_type: rec
  algorithm: SVTR_HGNet
    name: SVTRv2
    use_pos_embed: False
    dims: [128, 256, 384]
    depths: [6, 6, 6]
    num_heads: [4, 8, 12]
    mixer: [['Conv','Conv','Conv','Conv','Conv','Conv'],['Conv','Conv','Global','Global','Global','Global'],['Global','Global','Global','Global','Global','Global']]
    local_k: [[5, 5], [5, 5], [-1, -1]]
    sub_k: [[2, 1], [2, 1], [-1, -1]]
    last_stage: False
    use_pool: True
    name: MultiHead
      - CTCHead:
            name: svtr
            dims: 256
            depth: 2
            hidden_dims: 256
            kernel_size: [1, 3]
            use_guide: True
            fc_decay: 0.00001
      - NRTRHead:
          nrtr_dim: 384
          max_text_length: *max_text_length
          num_decoder_layers: 2

  name: MultiLoss
    - CTCLoss:
    - NRTRLoss:

  name: CTCLabelDecode

  name: RecMetric
  main_indicator: acc

    name: LMDBDataSet
    data_dir : ./CN_LMDB/TRAINSET
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    # - RecAug:
    - ParseQRecAug: ### Modify for long text avoid text can not be recognize.
    - RecResizeImg:
        image_shape: [3, 48, 800]
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        - image
        - label_ctc
        - label_gtc
        - length
        # - valid_ratio
    shuffle: true
    batch_size_per_card: 26
    drop_last: true
    num_workers: 8
    name: LMDBDataSet
    data_dir: ./CN_LMDB/VALIDATION_SET
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - RecResizeImg:
        image_shape: [3, 48, 800]
    - KeepKeys:
        - image
        - label_ctc
        - label_gtc
        - length
        # - valid_ratio
    shuffle: false
    drop_last: false
    batch_size_per_card: 26
    num_workers: 4
For private i need set data name to X: Group Length Results Each group will be group_labels = [label for label in valid_labels if len(label) <= max_length and len(label) > (max_length - 5)] Dataset Name Accuracy Avg Accuracy 5 Accuracy 10 Accuracy 15 Accuracy 20 Accuracy 25 Accuracy 30 Accuracy 35 Accuracy 40 Norm Edit Avg Norm Edit 5 Norm Edit 10 Norm Edit 15 Norm Edit 20 Norm Edit 25 Norm Edit 30 Norm Edit 35 Norm Edit 40
X 0.878446 - 0.928571 0.938272 0.872340 0.758621 0.999990 0.999990 - 0.991359 - 0.984127 0.994490 0.991213 0.985682 1.000000 1.000000 -
X 0.759409 0.750000 0.787879 0.606557 0.625000 0.500000 - - - 0.954413 0.930087 0.963324 0.954132 0.965872 0.916667 - - -
X 0.926185 0.926026 0.999995 - - - - - - 0.964395 0.964318 1.000000 - - - - - -
X 0.952561 0.953539 0.500000 - - - - - - 0.976011 0.976319 0.833334 - - - - - -
X 0.893976 0.898176 0.645161 - - - - - - 0.968258 0.968840 0.933756 - - - - - -
X 0.740319 0.861905 0.811736 0.707317 0.610000 0.500000 0.361702 0.333333 0.700000 0.961678 0.958634 0.961440 0.968510 0.970417 0.960568 0.948004 0.961215 0.981152
X 0.848485 0.809523 0.850000 1.000000 1.000000 1.000000 0.833332 0.750000 - 0.975273 0.957143 0.972538 1.000000 1.000000 1.000000 0.993827 0.984849 -
X 0.948509 - - 0.948509 - - - - - 0.995319 - - 0.995319 - - - - -
X 0.937669 - - 0.937669 - - - - - 0.993115 - 0.939394 0.993335 - - - - -
X 0.987167 0.975238 0.987974 0.992122 0.988573 0.990510 0.991315 0.992521 0.987753 0.933410 0.959002 0.949047 0.948371 0.914588 0.903638 0.921731 0.890909 0.887324
Topdu commented 2 days ago

The config file shown seems to be correct. As can be seen from the results of the test set, the accuracy does not decrease as the text length increases, which may be caused by the non-uniform distribution of the number of samples in the test set over the text length. In general, the longer the text, the lower the recognition accuracy. Possibly, this suggests that the test set needs to be reconstructed to make the evaluation results reliable. As for further improvement of the accuracy, it may be necessary to analyze the data in depth, such as the data distribution of the training set and test set, the analysis of bad cases, especially the test set with poorer accuracy. As for the training strategy and model, it needs to be set according to the specific information of data distribution. Here we give a few suggestions that are not sure if they are effective or not: 1, remove the NRTR Head and use only SVTRv2+CTC for training. 2, resize strategy try to group the images with the same aspect ratio, and then perform dynamic size training strategy instead of using padding. you can refer to the MSR section in the SVTRv2 paper.