PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
44.71k stars 7.86k forks source link

Cyrillic text recognition #13468

Closed KaraElan closed 1 week ago

KaraElan commented 4 months ago

As far as I understand, currently it is a bug of Cyrillic-based languages recognition.

Discussed in https://github.com/PaddlePaddle/PaddleOCR/discussions/13309

Originally posted by **KaraElan** July 8, 2024 Hello I am trying to recognize text written with Cyrillic alphabet (Russian, to be exact). Yet when I enable "rec""cyrillic" I still get Latin characters in the answer and, as far as I see, "dict_path": "./ppocr/utils/dict/cyrillic_dict.txt" includes Latin. It drastically decreases the recognition accuracy as Latin and Cyrillic alphabets contain many similarly looking characters. Is it possible to get an answer containing only Cyrillic characters automatically without retraining the model or doing some custom postprocessing?
UserWangZz commented 4 months ago

Latin letters are included because they still appear in some scenes. Regarding the problem you mentioned, maybe you need to regenerate a dictionary containing only cyrillic characters to retrain the model. Or delete the Latin letters in the recognition results (but this is not an effective method, I think)

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.