Open leduy-it opened 3 days ago
Please post the config file here.
And just to emphasize what i said before, I'm working with your PaddleOCR implementation, not this repo. I’ve noticed some differences, such as how input data resizing is handled. If there are significant differences between the two implementations, please let me know what modifications I would need to make to achieve performance comparable to this original repository.
Global:
debug: false
use_gpu: true
epoch_num: 5
log_smooth_window: 20
print_batch_step: 500
save_model_dir: output/rec_svtrv2_training_2811_02_48x800
save_epoch_step: 10
eval_batch_step: [0, 3000]
cal_metric_during_train: True
pretrained_model: output/rec_svtrv2_training_2811_01_48x800/best_accuracy.pdparams
checkpoints:
save_inference_dir:
use_visualdl: false
infer_img: doc/imgs_words/ch/word_1.jpg
character_dict_path: ./cn_6843_dict.txt
max_text_length: &max_text_length 100
infer_mode: false
use_space_char: true
distributed: true
save_res_path: ./output/rec/predicts_svrtv2.txt
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
epsilon: 1.e-8
weight_decay: 0.05
no_weight_decay_name: norm
one_dim_param_no_weight_decay: True
lr:
name: Cosine
learning_rate: 0.00001 ###LR small because of i try to continue finetuning after change size width from 640 to 800
# warmup_epoch: 1
warmup_steps: 23000 ####Modify epoch lr warm up epoch to step to continue finetuning after change size width from 640 to 800
Architecture:
model_type: rec
algorithm: SVTR_HGNet
Transform:
Backbone:
name: SVTRv2
use_pos_embed: False
dims: [128, 256, 384]
depths: [6, 6, 6]
num_heads: [4, 8, 12]
mixer: [['Conv','Conv','Conv','Conv','Conv','Conv'],['Conv','Conv','Global','Global','Global','Global'],['Global','Global','Global','Global','Global','Global']]
local_k: [[5, 5], [5, 5], [-1, -1]]
sub_k: [[2, 1], [2, 1], [-1, -1]]
last_stage: False
use_pool: True
Head:
name: MultiHead
head_list:
- CTCHead:
Neck:
name: svtr
dims: 256
depth: 2
hidden_dims: 256
kernel_size: [1, 3]
use_guide: True
Head:
fc_decay: 0.00001
- NRTRHead:
nrtr_dim: 384
max_text_length: *max_text_length
num_decoder_layers: 2
Loss:
name: MultiLoss
loss_config_list:
- CTCLoss:
- NRTRLoss:
PostProcess:
name: CTCLabelDecode
Metric:
name: RecMetric
main_indicator: acc
Train:
dataset:
name: LMDBDataSet
data_dir : ./CN_LMDB/TRAINSET
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
# - RecAug:
- ParseQRecAug: ### Modify for long text avoid text can not be recognize.
- RecResizeImg:
image_shape: [3, 48, 800]
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
# - valid_ratio
loader:
shuffle: true
batch_size_per_card: 26
drop_last: true
num_workers: 8
Eval:
dataset:
name: LMDBDataSet
data_dir: ./CN_LMDB/VALIDATION_SET
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- RecResizeImg:
image_shape: [3, 48, 800]
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
# - valid_ratio
loader:
shuffle: false
drop_last: false
batch_size_per_card: 26
num_workers: 4
For private i need set data name to X: Group Length Results Each group will be group_labels = [label for label in valid_labels if len(label) <= max_length and len(label) > (max_length - 5)] | Dataset Name | Accuracy Avg | Accuracy 5 | Accuracy 10 | Accuracy 15 | Accuracy 20 | Accuracy 25 | Accuracy 30 | Accuracy 35 | Accuracy 40 | Norm Edit Avg | Norm Edit 5 | Norm Edit 10 | Norm Edit 15 | Norm Edit 20 | Norm Edit 25 | Norm Edit 30 | Norm Edit 35 | Norm Edit 40 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X | 0.878446 | - | 0.928571 | 0.938272 | 0.872340 | 0.758621 | 0.999990 | 0.999990 | - | 0.991359 | - | 0.984127 | 0.994490 | 0.991213 | 0.985682 | 1.000000 | 1.000000 | - | |
X | 0.759409 | 0.750000 | 0.787879 | 0.606557 | 0.625000 | 0.500000 | - | - | - | 0.954413 | 0.930087 | 0.963324 | 0.954132 | 0.965872 | 0.916667 | - | - | - | |
X | 0.926185 | 0.926026 | 0.999995 | - | - | - | - | - | - | 0.964395 | 0.964318 | 1.000000 | - | - | - | - | - | - | |
X | 0.952561 | 0.953539 | 0.500000 | - | - | - | - | - | - | 0.976011 | 0.976319 | 0.833334 | - | - | - | - | - | - | |
X | 0.893976 | 0.898176 | 0.645161 | - | - | - | - | - | - | 0.968258 | 0.968840 | 0.933756 | - | - | - | - | - | - | |
X | 0.740319 | 0.861905 | 0.811736 | 0.707317 | 0.610000 | 0.500000 | 0.361702 | 0.333333 | 0.700000 | 0.961678 | 0.958634 | 0.961440 | 0.968510 | 0.970417 | 0.960568 | 0.948004 | 0.961215 | 0.981152 | |
X | 0.848485 | 0.809523 | 0.850000 | 1.000000 | 1.000000 | 1.000000 | 0.833332 | 0.750000 | - | 0.975273 | 0.957143 | 0.972538 | 1.000000 | 1.000000 | 1.000000 | 0.993827 | 0.984849 | - | |
X | 0.948509 | - | - | 0.948509 | - | - | - | - | - | 0.995319 | - | - | 0.995319 | - | - | - | - | - | |
X | 0.937669 | - | - | 0.937669 | - | - | - | - | - | 0.993115 | - | 0.939394 | 0.993335 | - | - | - | - | - | |
X | 0.987167 | 0.975238 | 0.987974 | 0.992122 | 0.988573 | 0.990510 | 0.991315 | 0.992521 | 0.987753 | 0.933410 | 0.959002 | 0.949047 | 0.948371 | 0.914588 | 0.903638 | 0.921731 | 0.890909 | 0.887324 |
The config file shown seems to be correct. As can be seen from the results of the test set, the accuracy does not decrease as the text length increases, which may be caused by the non-uniform distribution of the number of samples in the test set over the text length. In general, the longer the text, the lower the recognition accuracy. Possibly, this suggests that the test set needs to be reconstructed to make the evaluation results reliable. As for further improvement of the accuracy, it may be necessary to analyze the data in depth, such as the data distribution of the training set and test set, the analysis of bad cases, especially the test set with poorer accuracy. As for the training strategy and model, it needs to be set according to the specific information of data distribution. Here we give a few suggestions that are not sure if they are effective or not: 1, remove the NRTR Head and use only SVTRv2+CTC for training. 2, resize strategy try to group the images with the same aspect ratio, and then perform dynamic size training strategy instead of using padding. you can refer to the MSR section in the SVTRv2 paper.
First of all, thank you for the outstanding work on this repository. I have been using your implementation to fine-tune a model for recognizing extremely long text in Chinese. My input image dimensions can reach up to [48, 800], with aspect ratios approximately ≥ 16. While your current configuration supports a maximum aspect ratio of 4, I have modified it to handle these longer sequences, and my results are promising with an accuracy of around 90%. However, I aim to boost this accuracy further, ideally above 95%, given that my dataset is relatively straightforward.
Specific Questions:
def resize_norm_img(img, image_shape, padding=True, interpolation=cv2.INTER_LINEAR): imgC, imgH, imgW = image_shape h = img.shape[0] w = img.shape[1] if not padding: resized_image = cv2.resize(img, (imgW, imgH), interpolation=interpolation) resized_w = imgW else: ratio = w / float(h) if math.ceil(imgH ratio) > imgW: resized_w = imgW else: resized_w = int(math.ceil(imgH ratio)) resized_image = cv2.resize(img, (resized_w, imgH)) resized_image = resized_image.astype("float32") if image_shape[0] == 1: resized_image = resized_image / 255 resized_image = resized_image[np.newaxis, :] else: resized_image = resized_image.transpose((2, 0, 1)) / 255 resized_image -= 0.5 resized_image /= 0.5 padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) padding_im[:, :, 0:resized_w] = resized_image valid_ratio = min(1.0, float(resized_w / imgW)) return padding_im, valid_ratio
def resize_norm_img_chinese(img, image_shape): imgC, imgH, imgW = image_shape
todo: change to 0 and modified image shape