PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.48k stars 7.66k forks source link

Cyrillic text recognition #13468

Open KaraElan opened 1 month ago

KaraElan commented 1 month ago

As far as I understand, currently it is a bug of Cyrillic-based languages recognition.

Discussed in https://github.com/PaddlePaddle/PaddleOCR/discussions/13309

Originally posted by **KaraElan** July 8, 2024 Hello I am trying to recognize text written with Cyrillic alphabet (Russian, to be exact). Yet when I enable "rec""cyrillic" I still get Latin characters in the answer and, as far as I see, "dict_path": "./ppocr/utils/dict/cyrillic_dict.txt" includes Latin. It drastically decreases the recognition accuracy as Latin and Cyrillic alphabets contain many similarly looking characters. Is it possible to get an answer containing only Cyrillic characters automatically without retraining the model or doing some custom postprocessing?
UserWangZz commented 1 month ago

Latin letters are included because they still appear in some scenes. Regarding the problem you mentioned, maybe you need to regenerate a dictionary containing only cyrillic characters to retrain the model. Or delete the Latin letters in the recognition results (but this is not an effective method, I think)