Closed vikithakar closed 9 months ago
Thank you for reporting the issue!
@vikithakar
In order to make this work, I need to include Gujarati
, Hindi
and Sanskrit
codes here and here. For the second list, I need respective language written in original language; for example fra
in French is "Français"; ell
in Greek is "Ελληνικά".
Could you please provide original writing of the language name for Gujarati
, Hindi
and Sanskrit
?
guj
in Gujrati is "..." ?hin
in Hindi is "..." ?san
in "Sanskrit is "..." ?@ciur Original Language Name
guj
in Gujarati is ગુજરાતીhin
in Hindi is हिंदीsan
in Sanskrit is संस्कृत@vikithakar
PR for adding above mentioned languages.
Change will be available in 3.0.3 release
Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGEOCRDEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.
In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.
@vikithakar
Here is screenshot with working app (as mentioned above will be part of 3.0.3):
Description of Issue
After building Papermerge with
Gujarati
,Hindi
andSanskrit
Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.Build Details
Dockerfile
to add tesseract-ocr to papermergeInfo: