java.io.IOException: Command process failed with exit code 7. Error opening data file

artur-sannikov commented 7 months ago

I said that I'm using Podman quadlet in #1068.

For this issue, I used this .container file

[Container]
Image=docker.io/frooodle/s-pdf:latest
AutoUpdate=registry
PublishPort=8080:8080
Volume=/location/of/trainingData:/usr/share/tesseract-ocr/5/tessdata:Z
Volume=/location/of/logs/logs:/logs:Z

[Service]
Restart=always
[Install]
WantedBy=default.target

The version is 0.22.8.

My OCR fails when I use custom languages, no matter if tessdata or tessdata-fast.

java.io.IOException: Command process failed with exit code 7. Error message:   DEBUG ocrmypdf - ocrmypdf 15.4.2
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Found gs 10.2.1
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.197839
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (3):
eng
fin
ita

  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/input_7731425510989647362.pdf, /tmp/ocrmypdf.io.mz47xnde/origin)
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.mz47xnde/origin, /tmp/ocrmypdf.io.mz47xnde/origin.pdf)
  DEBUG root - Gathering info with 1 thread workers
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled

  DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3

In the logs this shows up

  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'fin', '/tmp/ocrmypdf.io.mz47xnde/000001_ocr.png', '/tmp/ocrmypdf.io.mz47xnde/000001_ocr_hocr', 'hocr', 'txt']
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Device[1] 0:(null) score is 0.197839
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  ERROR ocrmypdf._exec.tesseract -    1  [tesseract] Error opening data file /usr/share/tessdata/fin.traineddata
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] Failed loading language 'fin'
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] Tesseract couldn't load any languages!
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] Could not initialize tesseract.

OCR works but only with already bundled English.

Might be a problem with permissions but from my perspective everything is fine. I can access tesseract files from within the container and they belong to root.

I tried running podman as root

sudo podman run -d \
  -p 8080:8080 \
  -v /location/of/trainingData:/usr/share/tessdata \
  -v /location/of/logs:/logs \
  --name stirling-pdf \
  frooodle/s-pdf:latest

But it still results in the same issue.

Anyway, English OCR is pretty good with Latin languages.

Frooodle commented 7 months ago

Is this still ongoing?

Frooodle commented 7 months ago

I believe it has been fixed

artur-sannikov commented 7 months ago

Hello, I just don't use Tesseract because included English OCR is good enough for the languages I need.

Stirling-Tools / Stirling-PDF

java.io.IOException: Command process failed with exit code 7. Error opening data file #1071