Cyrillic text recognition

KaraElan commented 4 months ago

As far as I understand, currently it is a bug of Cyrillic-based languages recognition.

Discussed in https://github.com/PaddlePaddle/PaddleOCR/discussions/13309

^{Originally posted by **KaraElan** July 8, 2024} Hello I am trying to recognize text written with Cyrillic alphabet (Russian, to be exact). Yet when I enable "rec""cyrillic" I still get Latin characters in the answer and, as far as I see, "dict_path": "./ppocr/utils/dict/cyrillic_dict.txt" includes Latin. It drastically decreases the recognition accuracy as Latin and Cyrillic alphabets contain many similarly looking characters. Is it possible to get an answer containing only Cyrillic characters automatically without retraining the model or doing some custom postprocessing?

UserWangZz commented 4 months ago

Latin letters are included because they still appear in some scenes. Regarding the problem you mentioned, maybe you need to regenerate a dictionary containing only cyrillic characters to retrain the model. Or delete the Latin letters in the recognition results (but this is not an effective method, I think)

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

PaddlePaddle / PaddleOCR

Cyrillic text recognition #13468

Discussed in https://github.com/PaddlePaddle/PaddleOCR/discussions/13309