allmalab / problems

Challenges to solve in Azerbaijani NLP
7 stars 0 forks source link

Fine-tuning Tesseract OCR engine to recognize certain characters #4

Open ceferisbarov opened 6 months ago

ceferisbarov commented 6 months ago

Google's Tesseract OCR engine works quite well for most languages. However, it does not recognize "«" and "»" characters, which are used extensively in Azerbaijani texts. It is possible to fine-tune the model for special characters. In fact, Google provides a detailed tutorial for this. This is an open problem, and we would love to see a solution. We are also open to a collaboration, although we cannot commit to it full-time.