新增生僻字模型 - Githubissues

PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Apache License 2.0

44.1k stars 7.81k forks source link

背景

经过需求征集https://github.com/PaddlePaddle/PaddleOCR/issues/10334 和每周技术研讨会 https://github.com/PaddlePaddle/PaddleOCR/issues/10223 讨论，我们确定了新增生僻字模型的任务。

解决步骤

替换现有字典txt为扩充《通用规范汉字表》的字典。
在现有数据集上通过数据合成copy paste等方式实现语料的平衡，并重新训练PPOCRV3的检测和识别模型。
对比训练后模型在普通文字和生僻字上的检测、识别精度，并和PPOCRV3模型最优模型进行对比；达到普通字精度不变或者更高，生僻字上精度进一步提升的效果。
提交PR到ppocr，替换最优模型。

PaddlePaddle / PaddleOCR

新增生僻字模型 #10390

背景

解决步骤