My OCR fails when I use custom languages, no matter if tessdata or tessdata-fast.
java.io.IOException: Command process failed with exit code 7. Error message: DEBUG ocrmypdf - ocrmypdf 15.4.2
DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
DEBUG ocrmypdf.subprocess - Found gs 10.2.1
DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.197839
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (3):
eng
fin
ita
DEBUG ocrmypdf.helpers - pikepdf mmap enabled
DEBUG ocrmypdf.helpers - os.symlink(/tmp/input_7731425510989647362.pdf, /tmp/ocrmypdf.io.mz47xnde/origin)
DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.mz47xnde/origin, /tmp/ocrmypdf.io.mz47xnde/origin.pdf)
DEBUG root - Gathering info with 1 thread workers
DEBUG ocrmypdf.helpers - pikepdf mmap enabled
DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3
In the logs this shows up
DEBUG ocrmypdf.subprocess - 1 Running: ['tesseract', '-l', 'fin', '/tmp/ocrmypdf.io.mz47xnde/000001_ocr.png', '/tmp/ocrmypdf.io.mz47xnde/000001_ocr_hocr', 'hocr', 'txt']
INFO ocrmypdf._exec.tesseract - 1 [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
INFO ocrmypdf._exec.tesseract - 1 [tesseract] [DS] Device[1] 0:(null) score is 0.197839
INFO ocrmypdf._exec.tesseract - 1 [tesseract] [DS] Selected Device[1]: "(null)" (Native)
ERROR ocrmypdf._exec.tesseract - 1 [tesseract] Error opening data file /usr/share/tessdata/fin.traineddata
INFO ocrmypdf._exec.tesseract - 1 [tesseract] Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
INFO ocrmypdf._exec.tesseract - 1 [tesseract] Failed loading language 'fin'
INFO ocrmypdf._exec.tesseract - 1 [tesseract] Tesseract couldn't load any languages!
INFO ocrmypdf._exec.tesseract - 1 [tesseract] Could not initialize tesseract.
OCR works but only with already bundled English.
Might be a problem with permissions but from my perspective everything is fine. I can access tesseract files from within the container and they belong to root.
I said that I'm using Podman quadlet in #1068.
For this issue, I used this .container file
The version is 0.22.8.
My OCR fails when I use custom languages, no matter if tessdata or tessdata-fast.
In the logs this shows up
OCR works but only with already bundled English.
Might be a problem with permissions but from my perspective everything is fine. I can access tesseract files from within the container and they belong to root.
I tried running
podman
as rootBut it still results in the same issue.
Anyway, English OCR is pretty good with Latin languages.