PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.44k stars 7.65k forks source link

How did the team come to a conclusion on the Paddle OCR chinese characters - ppocr_keys_v1.txt ? #11668

Closed saikrishna431 closed 3 months ago

saikrishna431 commented 6 months ago

Hi I would like to understand what is the base for selecting the characters for Chinese language Vocab file ppocr_keys_v1.txt.

I see that the characters are combination of chinese Simplified and Traditional.

tink2123 commented 6 months ago

This dictionary selects common characters in Chinese to ensure that most scenarios are covered.

saikrishna431 commented 6 months ago

The total Number of characters in Chinese are more than 20000. Out of which only 6600 characters are selected. Can you elaborate on how did we select these characters. And is there any master list of characters from which the Vocab is selected in ppocr_keys_v1.txt