gyr-kdgc commented 3 years ago

自己用发票的数据训练了一个文本识别模型，训练集19000多张，测试集约5000张，使用的是Resnet34，训练了1000轮，acc在89%左右就上不去了。使用该模型测试时，总体效果还行，但是发现对于框出来很标准的文本，有时候会漏检某些字，例如“2020年12月04日”，就会把“日”漏掉，“有限公司”会识别成“有限司”等。请问一下训练的时候需要调整什么参数？或者在部署的时候需要修改什么参数才能有所改善吗？

-----------------------------------------以下是训练配置文件-------------------------------------------- Global: use_gpu: true epoch_num: 1500 log_smooth_window: 20 print_batch_step: 30 save_model_dir: ./invoice_output/rec_chinese_common_v2.0 save_epoch_step: 100

evaluation is run every 5000 iterations after the 4000th iteration

eval_batch_step: [0, 15000] cal_metric_during_train: True pretrained_model: checkpoints: save_inference_dir: use_visualdl: False infer_img: doc/imgs_words/ch/word_1.jpg

for data or label process

character_dict_path: ppocr/utils/ppocr_keys_v1.txt character_type: ch max_text_length: 50 # 25 infer_mode: False use_space_char: False

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 regularizer: name: 'L2' factor: 0.00004

Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: ResNet layers: 34 # 支持18,34,50,101,152,200 Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 256 Head: name: CTCHead fc_decay: 0.00004

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: SimpleDataSet data_dir: ./invoice_dataset/rec_data label_file_list: ["./invoice_dataset/rec_data/rec_gt_train.txt"] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
RecAug: use_tia: True aug_prob: 0.5
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: True batch_size_per_card: 64 drop_last: False num_workers: 0

Eval: dataset: name: SimpleDataSet data_dir: ./invoice_dataset/rec_data label_file_list: ["./invoice_dataset/rec_data/rec_gt_test.txt"] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: False drop_last: False batch_size_per_card: 16 num_workers: 0

gyr-kdgc commented 3 years ago

ioracion commented 3 years ago

记得某个issue有说目前版本max_text_length最好设置成默认（25），不然会有问题啥的

gyr-kdgc commented 3 years ago

记得某个issue有说目前版本max_text_length最好设置成默认（25），不然会有问题啥的

我这个文本识别其实也没有什么大问题，就是有上述的小问题，应该跟这个max_text_length关系不大吧？我感觉是不是我的数据量太少了，数字字母识别都比较准确，就是中文识别会出现少量漏掉或者错误

tink2123 commented 3 years ago

字典是用我们的默认字典吗？可以适当增加一些类似样本，去针对性的解bad case

gyr-kdgc commented 3 years ago

字典是用我们的默认字典吗？可以适当增加一些类似样本，去针对性的解bad case

字典用的是默认的字典，主要修改的参数也就max_text_length，batch_size，use_tia=true, use_space_char改为False。我只是想知道针对这种类似的问题有没有经验可循，因为官方给出的识别模型都可以很准确地识别出来，其实这些对我的检测没有太大的影响。如果没有类似的经验就算了，可能是我的数据集不够大的原因吧

kingwpf commented 3 years ago

你试试手动在测试图片右边padding几个sequence的空白看看，如果字识别到了就说明你训练的max_length太大了，默认的输入是80个sequence，max_length是25，max_length你改成50的话每个标注对应的sequence数量就不到2了

paddle-bot-old[bot] commented 2 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

Summerxu86 commented 1 year ago

您好，请问您的问题得到解决了吗？我目前也在做发票识别，效果不是很准确，想请教一下

tuobay commented 1 year ago

你试试手动在测试图片右边padding几个sequence的空白看看，如果字识别到了就说明你训练的max_length太大了，默认的输入是80个sequence，max_length是25，max_length你改成50的话每个标注对应的sequence数量就不到2了

请问最后一句话 max_length你改成50的话每个标注对应的sequence数量就不到2了怎么理解呢，所以除了max_length是可调参数，sequence也是可调参数吗？

PaddlePaddle / PaddleOCR

自己训练的文本识别模型，存在漏字的情况，请问有什么好的优化方法吗？ #2591

evaluation is run every 5000 iterations after the 4000th iteration

for data or label process