Gujarati, Hindi and Sanskrit Language OCR not working

ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)

https://papermerge.com

Apache License 2.0

2.49k stars 263 forks source link

Gujarati, Hindi and Sanskrit Language OCR not working #583

Closed vikithakar closed 7 months ago

vikithakar commented 8 months ago

Screenshot from 2024-01-22 18-03-00

Description of Issue

After building Papermerge with Gujarati, Hindi and Sanskrit Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.

Build Details

Dockerfile to add tesseract-ocr to papermerge

FROM papermerge/papermerge:3.0.2
RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y

Info:

Papermerge Version 3.0.2

ciur commented 8 months ago

Thank you for reporting the issue!

ciur commented 8 months ago

@vikithakar

In order to make this work, I need to include Gujarati, Hindi and Sanskrit codes here and here. For the second list, I need respective language written in original language; for example fra in French is "Français"; ell in Greek is "Ελληνικά".

Could you please provide original writing of the language name for Gujarati, Hindi and Sanskrit ?

guj in Gujrati is "..." ?
hin in Hindi is "..." ?
san in "Sanskrit is "..." ?

vikithakar commented 8 months ago

@ciur Original Language Name

guj in Gujarati is ગુજરાતી
hin in Hindi is हिंदी
san in Sanskrit is संस्कृत

ciur commented 8 months ago

@vikithakar

PR for adding above mentioned languages.

Change will be available in 3.0.3 release

Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGEOCRDEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.

In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.

lang-codes

ciur commented 8 months ago

@vikithakar

Here is screenshot with working app (as mentioned above will be part of 3.0.3):

papermerge-with-hindi-text