Low Validation Accuracy for PPOCRv4 if Image resolution is changed

ManikSinghSarmaal commented 3 months ago

问题描述 / Problem Description

I've been going through a problem since a month where (Specially for PPOCRv4) if I change the Image Resolution to [3,32,150] instead of default [3,48,320]. Something is off because PPOCRv4 introduces MultiScaleDataset for training in config and SimpleDataset for evaluation, whereas it was only SimpleDataset for both train and eval in config of PPOCRv3. The main problem is if you change Image resolution as what I said above to [3,32,150] in config of v4, the train accuracy is high around 95% while evaluation on that gives very poor results and this is not even a case of overfitting as I tried training on train+eval data which gave in logs around 98% accuracy for training while evaluation on eval dataset is in 40%, I even tried changing eval and train dataloaders to both MultiScaleDataset and SimpleDataset but it didn't help, something is there I cannot understand as how training on some data shows 98% accuracy and above and evaluation on the same small part of that data gives accuracies in 40%. Same happens for train data, during training logs it shows accuracy of 98% and when i evaluate same training data on best_accuracy.pdparams it gives off accuracy 58% which was for the evaluation, what could be the reason for this discrepancy ?

运行环境 / Runtime Environment

PaddleOCR: Problem is here in PPOCRv4

复现代码 / Reproduction Code

My config is - `Global: debug: true use_gpu: true epoch_num: 500

log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/finally_v4 save_epoch_step: 100 eval_batch_step:

0
1000 cal_metric_during_train: true pretrained_model: null checkpoints: null save_inference_dir: null use_visualdl: false infer_img: doc/imgs_words/ch/word_1.jpg character_dict_path: datasets/custom_dict.txt max_text_length: 25 infer_mode: false use_space_char: false distributed: true save_res_path: ./output/rec/omlette_ppocrv4.txt use_wandb: false

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.0005 warmup_epoch: 5 regularizer: name: L2 factor: 3.0e-05 Architecture: model_type: rec algorithm: SVTR_LCNet Transform: null Backbone: name: PPLCNetV3 scale: 0.95 Head: name: MultiHead head_list:

CTCHead: Neck: name: svtr dims: 120 depth: 2 hidden_dims: 120 kernel_size:
- 1
- 3 use_guide: true Head: fc_decay: 1.0e-05
NRTRHead: nrtr_dim: 384 max_text_length: 25 Loss: name: MultiLoss loss_config_list:
- CTCLoss: null
- NRTRLoss: null PostProcess: name: CTCLabelDecode Metric: name: RecMetric main_indicator: acc ignore_space: false Train: dataset: name: MultiScaleDataSet ds_width: false data_dir: ./ ext_op_transform_idx: 1 label_file_list:
datasets/combined_train_test.txt transforms:
DecodeImage: img_mode: BGR channel_first: false
RecConAug: prob: 0.5 ext_data_num: 2 image_shape:
- 32
- 150
- 3 max_text_length: 25
RecAug: null
MultiLabelEncode: gtc_encode: NRTRLabelEncode
KeepKeys: keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio sampler: name: MultiScaleSampler scales:
- 320
  - 32
- 320
  - 48
- 320
  - 64 first_bs: 96 fix_bs: false divided_factor:
8
16 is_training: true loader: shuffle: true batch_size_per_card: 128 drop_last: true num_workers: 8 Eval: dataset: name: MultiScaleDataSet ds_width: false data_dir: ./ label_file_list:
./datasets/manik_test.txt transforms:
DecodeImage: img_mode: BGR channel_first: false
MultiLabelEncode: gtc_encode: NRTRLabelEncode
RecResizeImg: image_shape:
- 3
- 32
- 150
KeepKeys: keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio sampler: name: MultiScaleSampler scales:
- [150, 32] # Use a single scale for evaluation first_bs: 128 fix_bs: true divided_factor: [8, 16] is_training: false loader: shuffle: false drop_last: false batch_size_per_card: 128 num_workers: 4 profiler_options: null`

my train data is T1+V1 and my eval data is V1

完整报错 / Complete Error Message

`ppocr INFO: epoch: [199/500], global_step: 71970, lr: 0.000337, acc: 0.950521, norm_edit_dis: 0.993425, CTCLoss: 0.303188, NRTRLoss: 0.711274, loss: 1.013391, avg_reader_cost: 0.00021 s, avg_batch_cost: 0.78093 s, avg_samples: 84.8, ips: 108.58877 samples/s, eta: 1 day, 4:10:31, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB [2024/07/03 05:54:06] ppocr INFO: epoch: [199/500], global_step: 71980, lr: 0.000337, acc: 0.953125, norm_edit_dis: 0.993425, CTCLoss: 0.268923, NRTRLoss: 0.712596, loss: 0.986175, avg_reader_cost: 0.00024 s, avg_batch_cost: 0.79516 s, avg_samples: 64.0, ips: 80.48709 samples/s, eta: 1 day, 4:10:19, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB [2024/07/03 05:54:14] ppocr INFO: epoch: [199/500], global_step: 71990, lr: 0.000337, acc: 0.958333, norm_edit_dis: 0.992793, CTCLoss: 0.238311, NRTRLoss: 0.712596, loss: 0.950907, avg_reader_cost: 0.00021 s, avg_batch_cost: 0.79268 s, avg_samples: 75.2, ips: 94.86814 samples/s, eta: 1 day, 4:10:08, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB [2024/07/03 05:54:22] ppocr INFO: epoch: [199/500], global_step: 72000, lr: 0.000337, acc: 0.942708, norm_edit_dis: 0.992840, CTCLoss: 0.244233, NRTRLoss: 0.708435, loss: 0.952667, avg_reader_cost: 0.00021 s, avg_batch_cost: 0.77737 s, avg_samples: 78.4, ips: 100.85287 samples/s, eta: 1 day, 4:09:57, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB

eval model:: 0%| | 0/34 [00:00<?, ?it/s] eval model:: 3%|▎ | 1/34 [00:00<00:10, 3.18it/s] eval model:: 9%|▉ | 3/34 [00:00<00:03, 8.38it/s] eval model:: 15%|█▍ | 5/34 [00:00<00:02, 11.93it/s] eval model:: 21%|██ | 7/34 [00:00<00:01, 14.33it/s] eval model:: 26%|██▋ | 9/34 [00:00<00:01, 16.04it/s] eval model:: 35%|███▌ | 12/34 [00:00<00:01, 17.63it/s] eval model:: 44%|████▍ | 15/34 [00:01<00:01, 18.51it/s] eval model:: 53%|█████▎ | 18/34 [00:01<00:00, 19.07it/s] eval model:: 62%|██████▏ | 21/34 [00:01<00:00, 19.44it/s] eval model:: 71%|███████ | 24/34 [00:01<00:00, 19.67it/s] eval model:: 76%|███████▋ | 26/34 [00:01<00:00, 19.24it/s] eval model:: 82%|████████▏ | 28/34 [00:01<00:00, 18.40it/s] eval model:: 88%|████████▊ | 30/34 [00:01<00:00, 12.62it/s] eval model:: 94%|█████████▍| 32/34 [00:02<00:00, 13.43it/s] eval model:: 100%|██████████| 34/34 [00:02<00:00, 13.86it/s] eval model:: 100%|██████████| 34/34 [00:04<00:00, 8.21it/s] [2024/07/03 05:54:26] ppocr INFO: cur metric, acc: 0.48791821447974393, norm_edit_dis: 0.9153030167021731, fps: 2919.5675084361587 [2024/07/03 05:54:26] ppocr INFO: best metric, acc: 0.5864312254032732, is_float16: False, norm_edit_dis: 0.9353985370995163, fps: 2784.2707554228186, best_epoch: 188 [2024/07/03 05:54:34] ppocr INFO: epoch: [199/500], global_step: 72010, lr: 0.000337, acc: 0.955729, norm_edit_dis: 0.994558, CTCLoss: 0.214817, NRTRLoss: 0.706834, loss: 0.922900, avg_reader_cost: 0.00024 s, avg_batch_cost: 0.78858 s, avg_samples: 72.0, ips: 91.30309 samples/s, eta: 1 day, 4:09:45, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB [2024/07/03 05:54:42] ppocr INFO: epoch: [199/500], global_step: 72020, lr: 0.000337, acc: 0.953125, norm_edit_dis: 0.993750, CTCLoss: 0.224658, NRTRLoss: 0.709030, loss: 0.934898, avg_reader_cost: 0.00021 s, avg_batch_cost: 0.78265 s, avg_samples: 67.2, ips: 85.86192 samples/s, eta: 1 day, 4:09:34, max_mem_reserved: 12328 MB, max_mem_allocated: 12052 MB `

可能解决方案 / Possible solutions

Please help me around this issue on this repo

附件 / Appendix

Topdu commented 3 months ago

MultiScaleDataset is a training strategy used to improve accuracy, it cannot be used during evaluation, please refer to PP-OCRv4 configuration file to set the parameters of Eval dataset.

ManikSinghSarmaal commented 3 months ago

Thanks for helping it out but i figured it out that in the backbone used in ppocrv4 i.e. rec_lcnetv3.py in the forward pass you have used adaptive_avg_pool2D to a fixed size in training mode and avgpool_2D in evaluation mode. if self.training: x = F.adaptive_avg_pool2d(x, [1, 40]) else: x = F.avg_pool2d(x, [3, 2]) return x This works well with image size 48,320[height,width] as image shape in both cases i.e. training and evaluation, adaptive and normal pooling gives same size coincidently but if you change image resolution to (32,150), shapes at the end will be different in training and evaluation mode, training will give [1,40] fixed size defined by adaptive avg pool but evaluation uses normal avgpool_2D which gives size [1,19], this was the main reason why I was getting lower accuracy on changing resolution from the default 48,320. I still don't know if this was intentional or what was the cause to remove adaptive avg pool in evaluation mode ?? Kindly elaborate on this

xiaomaxiao commented 1 month ago

Because in the training stage, MutilScaleSampler uses three different heights of 32, 48, and 64, adaptive_avg_pool2d(x, [1, 40])is used to deal with different heights, but the '40' is fixed even you change image width to 150. So in evaluation mode use F.avg_pool2d(x, [3, 2]) to deal with different image width. In the training stage, if you want use others width eg. 150 , 640 ...

use x = F.avg_pool2d(x, [3, 2]) both train and evaluation
del MutilScaleSampler , just use 48

PaddlePaddle / PaddleOCR