leduy-it commented 3 days ago

First of all, thank you for the outstanding work on this repository. I have been using your implementation to fine-tune a model for recognizing extremely long text in Chinese. My input image dimensions can reach up to [48, 800], with aspect ratios approximately ≥ 16. While your current configuration supports a maximum aspect ratio of 4, I have modified it to handle these longer sequences, and my results are promising with an accuracy of around 90%. However, I aim to boost this accuracy further, ideally above 95%, given that my dataset is relatively straightforward.

Specific Questions:

Dynamic Resizing for Long Aspect Ratios: Could you provide recommendations for optimizing dynamic resizing for long text or any configuration adjustments to better handle these extreme aspect ratios during training?
- Performance Comparison: I am currently utilizing your implementation in PaddleOCR for this task. Are there any significant differences in performance or accuracy when compared to this PyTorch-based repository? If so, which would you recommend for tasks involving extremely long text?
- Suggestions for Further Improvements: What additional techniques (e.g., specific augmentations, model architecture tweaks, or loss function modifications) would you suggest to improve accuracy in such cases?

    - RecResizeImg:
        image_shape: [3, 48, 800]
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length

class RecResizeImg(object):
def __init__(
    self,
    image_shape,
    infer_mode=False,
    eval_mode=False,
    character_dict_path="./ppocr/utils/ppocr_keys_v1.txt",
    padding=True,
    **kwargs,
):
    self.image_shape = image_shape
    self.infer_mode = infer_mode
    self.eval_mode = eval_mode
    self.character_dict_path = character_dict_path
    self.padding = padding

def __call__(self, data):
    img = data["image"]
    if self.eval_mode or (self.infer_mode and self.character_dict_path is not None):
        norm_img, valid_ratio = resize_norm_img_chinese(img, self.image_shape)
    else:
        norm_img, valid_ratio = resize_norm_img(img, self.image_shape, self.padding)

    data["image"] = norm_img
    data["valid_ratio"] = valid_ratio
    return data

def resize_norm_img(img, image_shape, padding=True, interpolation=cv2.INTER_LINEAR): imgC, imgH, imgW = image_shape h = img.shape[0] w = img.shape[1] if not padding: resized_image = cv2.resize(img, (imgW, imgH), interpolation=interpolation) resized_w = imgW else: ratio = w / float(h) if math.ceil(imgH ratio) > imgW: resized_w = imgW else: resized_w = int(math.ceil(imgH ratio)) resized_image = cv2.resize(img, (resized_w, imgH)) resized_image = resized_image.astype("float32") if image_shape[0] == 1: resized_image = resized_image / 255 resized_image = resized_image[np.newaxis, :] else: resized_image = resized_image.transpose((2, 0, 1)) / 255 resized_image -= 0.5 resized_image /= 0.5 padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) padding_im[:, :, 0:resized_w] = resized_image valid_ratio = min(1.0, float(resized_w / imgW)) return padding_im, valid_ratio

def resize_norm_img_chinese(img, image_shape): imgC, imgH, imgW = image_shape

todo: change to 0 and modified image shape

max_wh_ratio = imgW * 1.0 / imgH
h, w = img.shape[0], img.shape[1]
ratio = w * 1.0 / h
max_wh_ratio = max(max_wh_ratio, ratio)
imgW = int(imgH * max_wh_ratio)
if math.ceil(imgH * ratio) > imgW:
    resized_w = imgW
else:
    resized_w = int(math.ceil(imgH * ratio))
resized_image = cv2.resize(img, (resized_w, imgH))
resized_image = resized_image.astype("float32")
if image_shape[0] == 1:
    resized_image = resized_image / 255
    resized_image = resized_image[np.newaxis, :]
else:
    resized_image = resized_image.transpose((2, 0, 1)) / 255
resized_image -= 0.5
resized_image /= 0.5
padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)
padding_im[:, :, 0:resized_w] = resized_image
valid_ratio = min(1.0, float(resized_w / imgW))
return padding_im, valid_ratio


Any guidance or insights you could provide would be greatly appreciated. Thank you again for maintaining and improving this fantastic project!

![image](https://github.com/user-attachments/assets/236bc657-257e-4b71-9217-550522cc70f9)
![image](https://github.com/user-attachments/assets/93ee0baf-54cf-412b-a96d-fe6af980c902)

Topdu commented 2 days ago

Please post the config file here.

How does the accuracy varies with text length on the test set?
Are the text images from multiple scenes or a single scene, such as a document?

leduy-it commented 2 days ago

I'll share my config for your review first. Once I evaluate accuracy across different text lengths on the test set with the new checkpoint, I'll update you.
I'm working on a single scene, specifically documents.

And just to emphasize what i said before, I'm working with your PaddleOCR implementation, not this repo. I’ve noticed some differences, such as how input data resizing is handled. If there are significant differences between the two implementations, please let me know what modifications I would need to make to achieve performance comparable to this original repository.

Global:
  debug: false
  use_gpu: true
  epoch_num: 5
  log_smooth_window: 20
  print_batch_step: 500
  save_model_dir: output/rec_svtrv2_training_2811_02_48x800
  save_epoch_step: 10
  eval_batch_step: [0, 3000]
  cal_metric_during_train: True
  pretrained_model: output/rec_svtrv2_training_2811_01_48x800/best_accuracy.pdparams
  checkpoints: 
  save_inference_dir:
  use_visualdl: false
  infer_img: doc/imgs_words/ch/word_1.jpg
  character_dict_path: ./cn_6843_dict.txt
  max_text_length: &max_text_length 100
  infer_mode: false
  use_space_char: true
  distributed: true
  save_res_path: ./output/rec/predicts_svrtv2.txt

Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.999
  epsilon: 1.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm
  one_dim_param_no_weight_decay: True
  lr:
    name: Cosine
    learning_rate: 0.00001 ###LR small because of i try to continue finetuning after change size width from 640 to 800 
    # warmup_epoch: 1
    warmup_steps: 23000 ####Modify epoch lr warm up epoch to step to continue finetuning after change size width from 640 to 800 

Architecture:
  model_type: rec
  algorithm: SVTR_HGNet
  Transform:
  Backbone:
    name: SVTRv2
    use_pos_embed: False
    dims: [128, 256, 384]
    depths: [6, 6, 6]
    num_heads: [4, 8, 12]
    mixer: [['Conv','Conv','Conv','Conv','Conv','Conv'],['Conv','Conv','Global','Global','Global','Global'],['Global','Global','Global','Global','Global','Global']]
    local_k: [[5, 5], [5, 5], [-1, -1]]
    sub_k: [[2, 1], [2, 1], [-1, -1]]
    last_stage: False
    use_pool: True
  Head:
    name: MultiHead
    head_list:
      - CTCHead:
          Neck:
            name: svtr
            dims: 256
            depth: 2
            hidden_dims: 256
            kernel_size: [1, 3]
            use_guide: True
          Head:
            fc_decay: 0.00001
      - NRTRHead:
          nrtr_dim: 384
          max_text_length: *max_text_length
          num_decoder_layers: 2

Loss:
  name: MultiLoss
  loss_config_list:
    - CTCLoss:
    - NRTRLoss:

PostProcess:  
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc

Train:
  dataset:
    name: LMDBDataSet
    data_dir : ./CN_LMDB/TRAINSET
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    # - RecAug:
    - ParseQRecAug: ### Modify for long text avoid text can not be recognize.
    - RecResizeImg:
        image_shape: [3, 48, 800]
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        # - valid_ratio
  loader:
    shuffle: true
    batch_size_per_card: 26
    drop_last: true
    num_workers: 8
Eval:
  dataset:
    name: LMDBDataSet
    data_dir: ./CN_LMDB/VALIDATION_SET
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - RecResizeImg:
        image_shape: [3, 48, 800]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        # - valid_ratio
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 26
    num_workers: 4

For private i need set data name to X: Group Length Results Each group will be group_labels = [label for label in valid_labels if len(label) <= max_length and len(label) > (max_length - 5)]	Dataset Name	Accuracy Avg	Accuracy 5	Accuracy 10	Accuracy 15	Accuracy 20	Accuracy 25	Accuracy 30	Accuracy 35	Accuracy 40	Norm Edit Avg	Norm Edit 5	Norm Edit 10	Norm Edit 15	Norm Edit 20	Norm Edit 25	Norm Edit 30	Norm Edit 35
X	0.878446	-	0.928571	0.938272	0.872340	0.758621	0.999990	0.999990	-	0.991359	-	0.984127	0.994490	0.991213	0.985682	1.000000	1.000000	-
X	0.759409	0.750000	0.787879	0.606557	0.625000	0.500000	-	-	-	0.954413	0.930087	0.963324	0.954132	0.965872	0.916667	-	-	-
X	0.926185	0.926026	0.999995	-	-	-	-	-	-	0.964395	0.964318	1.000000	-	-	-	-	-	-
X	0.952561	0.953539	0.500000	-	-	-	-	-	-	0.976011	0.976319	0.833334	-	-	-	-	-	-
X	0.893976	0.898176	0.645161	-	-	-	-	-	-	0.968258	0.968840	0.933756	-	-	-	-	-	-
X	0.740319	0.861905	0.811736	0.707317	0.610000	0.500000	0.361702	0.333333	0.700000	0.961678	0.958634	0.961440	0.968510	0.970417	0.960568	0.948004	0.961215	0.981152
X	0.848485	0.809523	0.850000	1.000000	1.000000	1.000000	0.833332	0.750000	-	0.975273	0.957143	0.972538	1.000000	1.000000	1.000000	0.993827	0.984849	-
X	0.948509	-	-	0.948509	-	-	-	-	-	0.995319	-	-	0.995319	-	-	-	-	-
X	0.937669	-	-	0.937669	-	-	-	-	-	0.993115	-	0.939394	0.993335	-	-	-	-	-
X	0.987167	0.975238	0.987974	0.992122	0.988573	0.990510	0.991315	0.992521	0.987753	0.933410	0.959002	0.949047	0.948371	0.914588	0.903638	0.921731	0.890909	0.887324

Topdu commented 2 days ago

The config file shown seems to be correct. As can be seen from the results of the test set, the accuracy does not decrease as the text length increases, which may be caused by the non-uniform distribution of the number of samples in the test set over the text length. In general, the longer the text, the lower the recognition accuracy. Possibly, this suggests that the test set needs to be reconstructed to make the evaluation results reliable. As for further improvement of the accuracy, it may be necessary to analyze the data in depth, such as the data distribution of the training set and test set, the analysis of bad cases, especially the test set with poorer accuracy. As for the training strategy and model, it needs to be set according to the specific information of data distribution. Here we give a few suggestions that are not sure if they are effective or not: 1, remove the NRTR Head and use only SVTRv2+CTC for training. 2, resize strategy try to group the images with the same aspect ratio, and then perform dynamic size training strategy instead of using padding. you can refer to the MSR section in the SVTRv2 paper.

Topdu / OpenOCR

Support for Training with Extremely Long Images (e.g., [48, 800] or Aspect Ratio ≥ 16) #41

todo: change to 0 and modified image shape