ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)
https://papermerge.com
Apache License 2.0
2.55k stars 267 forks source link

Gujarati, Hindi and Sanskrit Language OCR not working #583

Closed vikithakar closed 9 months ago

vikithakar commented 10 months ago

Screenshot from 2024-01-22 18-03-00

Description of Issue

After building Papermerge with Gujarati, Hindi and Sanskrit Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.

Build Details

Dockerfile to add tesseract-ocr to papermerge

FROM papermerge/papermerge:3.0.2
RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y

Info:

ciur commented 10 months ago

Thank you for reporting the issue!

ciur commented 10 months ago

@vikithakar

In order to make this work, I need to include Gujarati, Hindi and Sanskrit codes here and here. For the second list, I need respective language written in original language; for example fra in French is "Français"; ell in Greek is "Ελληνικά".

Could you please provide original writing of the language name for Gujarati, Hindi and Sanskrit ?

vikithakar commented 10 months ago

@ciur Original Language Name

ciur commented 10 months ago

@vikithakar

PR for adding above mentioned languages.

Change will be available in 3.0.3 release

Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGEOCRDEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.

In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.

lang-codes

ciur commented 10 months ago

@vikithakar

Here is screenshot with working app (as mentioned above will be part of 3.0.3):

papermerge-with-hindi-text