Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files
https://stirlingpdf.com
MIT License
44.43k stars 3.63k forks source link

[Bug]: OCR fails with exit Code 4 #1838

Closed jotpunktopunkt closed 1 month ago

jotpunktopunkt commented 1 month ago

Installation Method

Docker

The Problem

hi

thanks for putting so much and great effort into stirling pdf, it's great!

i have the default eng.traineddata and the 15mb large deu.traineddata installed

whenever i try to run OCR over a pdf, no matter the language (but most often i try the german package) i get an error stating exit code 4

the last line of the container log (portainer) states: WARNING ocrmypdf._pipelines._common - Output file: The generated PDF is INVALID

i have deleted every folder and have re-initialized the container, but to no avail and no difference i don't know where to keep looking to fix this and i don't know which logs exactly to provide

Version of Stirling-PDF

0.28.3

Last Working Version of Stirling-PDF

never

Page Where the Problem Occurred

OCR

Docker Configuration

version: '3.3'
services:
  stirling-pdf:
    image: frooodle/s-pdf:latest
    ports:
      - '8080:8080'
    volumes:
      - /root/stirling-pdf/trainingData:/usr/share/tessdata #Required for extra OCR languages
      - /root/stirling-pdf/extraConfigs:/configs
      - /root/stirling-pdf/customFiles:/customFiles/
    environment:
      - DOCKER_ENABLE_SECURITY=false
    restart: always

Relevant Log Output

Portainer Log: (last line)
WARNING ocrmypdf._pipelines._common - Output file: The generated PDF is INVALID

Stack Trace:
java.io.IOException: Command process failed with exit code 4. Error message:   DEBUG ocrmypdf - ocrmypdf 16.1.1
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Found gs 10.3.1
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.197885
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (2):
deu
eng

  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/input_4473422127442166983.pdf, /tmp/ocrmypdf.io._0jwju8h/origin)
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io._0jwju8h/origin, /tmp/ocrmypdf.io._0jwju8h/origin.pdf)
  DEBUG root - Gathering info with 1 thread workers
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled

  DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 1
   INFO ocrmypdf._pipelines.ocr - Start processing 2 pages concurrently
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf._pipeline -    1  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io._0jwju8h/origin.pdf']
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf._pipeline -    2  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -    2  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=2', '-dLastPage=2', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io._0jwju8h/origin.pdf']
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'sRGB' 41 1
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 54 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'tEXt' 75 32
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 119 8192
  DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0
  DEBUG PIL.PngImagePlugin -    2  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    2  STREAM b'sRGB' 41 1
  DEBUG PIL.PngImagePlugin -    2  STREAM b'pHYs' 54 9
  DEBUG PIL.PngImagePlugin -    2  STREAM b'tEXt' 75 32
  DEBUG PIL.PngImagePlugin -    2  STREAM b'IDAT' 119 8192
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    1  resolution (299.9994, 299.9994)
  DEBUG PIL.PngImagePlugin -    2  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    2  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    2  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    2  resolution (299.9994, 299.9994)
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io._0jwju8h/000001_ocr.png', '/tmp/ocrmypdf.io._0jwju8h/000001_ocr_hocr', 'hocr', 'txt']
  DEBUG ocrmypdf.subprocess -    2  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io._0jwju8h/000002_ocr.png', '/tmp/ocrmypdf.io._0jwju8h/000002_ocr_hocr', 'hocr', 'txt']
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Device[1] 0:(null) score is 0.197885
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  DEBUG ocrmypdf.hocrtransform._hocr -    1  pikepdf.Matrix(0.24, 0, 0, -0.24, 0, 414.72)
  DEBUG ocrmypdf.hocrtransform._hocr -    1  pikepdf.Matrix(1, 0, 0, 1, 0, 1728)
  DEBUG ocrmypdf._pipeline -    3  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -    3  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=3', '-dLastPage=3', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io._0jwju8h/origin.pdf']
  DEBUG ocrmypdf._graft -    1  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 90) -> 270
  DEBUG ocrmypdf._graft -    1  Grafting
  DEBUG ocrmypdf._graft -    1  Grafting with ctm pikepdf.Matrix(6.12323e-17, 1, -1, 6.12323e-17, 414.72, 0)
  DEBUG ocrmypdf._graft -    1  Page rotation: (content, auto) -> page = (90, 0) -> 90
   INFO ocrmypdf._exec.tesseract -    2  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    2  [tesseract] [DS] Device[1] 0:(null) score is 0.197885
   INFO ocrmypdf._exec.tesseract -    2  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  DEBUG ocrmypdf.hocrtransform._hocr -    2  pikepdf.Matrix(0.24, 0, 0, -0.24, 0, 414.72)
  DEBUG ocrmypdf.hocrtransform._hocr -    2  pikepdf.Matrix(1, 0, 0, 1, 0, 1728)
  DEBUG ocrmypdf._pipeline -    4  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf._graft -    2  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
  DEBUG ocrmypdf._graft -    2  Grafting
  DEBUG ocrmypdf._graft -    2  Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0)
  DEBUG ocrmypdf._graft -    2  Page rotation: (content, auto) -> page = (0, 0) -> 0
  DEBUG ocrmypdf.subprocess -    4  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=4', '-dLastPage=4', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io._0jwju8h/origin.pdf']
  DEBUG PIL.PngImagePlugin -    3  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    3  STREAM b'sRGB' 41 1
  DEBUG PIL.PngImagePlugin -    3  STREAM b'pHYs' 54 9
  DEBUG PIL.PngImagePlugin -    3  STREAM b'tEXt' 75 32
  DEBUG PIL.PngImagePlugin -    3  STREAM b'IDAT' 119 8192
  DEBUG PIL.PngImagePlugin -    3  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    3  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    3  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    3  resolution (299.9994, 299.9994)
  DEBUG ocrmypdf.subprocess -    3  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io._0jwju8h/000003_ocr.png', '/tmp/ocrmypdf.io._0jwju8h/000003_ocr_hocr', 'hocr', 'txt']
  DEBUG PIL.PngImagePlugin -    4  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    4  STREAM b'sRGB' 41 1
  DEBUG PIL.PngImagePlugin -    4  STREAM b'pHYs' 54 9
  DEBUG PIL.PngImagePlugin -    4  STREAM b'tEXt' 75 32
  DEBUG PIL.PngImagePlugin -    4  STREAM b'IDAT' 119 8192
   INFO ocrmypdf._exec.tesseract -    3  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    3  [tesseract] [DS] Device[1] 0:(null) score is 0.197885
   INFO ocrmypdf._exec.tesseract -    3  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(0.24, 0, 0, -0.24, 0, 414.72)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2144, 363)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2236, 784)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2202, 851)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2299, 918)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2201, 1053)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2099, 1120)
  DEBUG ocrmypdf.hocrtransform._hocr -    3  eng
  DEBUG ocrmypdf.hocrtransform._hocr -    3  pikepdf.Matrix(1, 0, 0, 1, 2134, 1500)
  DEBUG ocrmypdf._pipeline -    5  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -    5  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=5', '-dLastPage=5', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io._0jwju8h/origin.pdf']
  DEBUG ocrmypdf._graft -    3  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 90) -> 270
  DEBUG ocrmypdf._graft -    3  Grafting
  DEBUG ocrmypdf._graft -    3  Grafting with ctm pikepdf.Matrix(6.12323e-17, 1, -1, 6.12323e-17, 414.72, 0)
  DEBUG ocrmypdf._graft -    3  Page rotation: (content, auto) -> page = (90, 0) -> 90
  DEBUG PIL.PngImagePlugin -    4  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    4  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    4  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    4  resolution (299.9994, 299.9994)
  DEBUG PIL.PngImagePlugin -    5  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    5  STREAM b'sRGB' 41 1
  DEBUG PIL.PngImagePlugin -    5  STREAM b'pHYs' 54 9
  DEBUG PIL.PngImagePlugin -    5  STREAM b'tEXt' 75 32
  DEBUG PIL.PngImagePlugin -    5  STREAM b'IDAT' 119 8192
  DEBUG ocrmypdf.subprocess -    4  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io._0jwju8h/000004_ocr.png', '/tmp/ocrmypdf.io._0jwju8h/000004_ocr_hocr', 'hocr', 'txt']
   INFO ocrmypdf._exec.tesseract -    4  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    4  [tesseract] [DS] Device[1] 0:(null) score is 0.197885
   INFO ocrmypdf._exec.tesseract -    4  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  DEBUG ocrmypdf.hocrtransform._hocr -    4  pikepdf.Matrix(0.24, 0, 0, -0.24, 0, 414.72)
  DEBUG ocrmypdf.hocrtransform._hocr -    4  pikepdf.Matrix(1, 0, 0, 1, 0, 1728)
  DEBUG ocrmypdf._graft -    4  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 90) -> 270
  DEBUG ocrmypdf._graft -    4  Grafting
  DEBUG ocrmypdf._graft -    4  Grafting with ctm pikepdf.Matrix(6.12323e-17, 1, -1, 6.12323e-17, 414.72, 0)
  DEBUG ocrmypdf._graft -    4  Page rotation: (content, auto) -> page = (90, 0) -> 90
  DEBUG PIL.PngImagePlugin -    5  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    5  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    5  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    5  resolution (299.9994, 299.9994)
  DEBUG ocrmypdf.subprocess -    5  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io._0jwju8h/000005_ocr.png', '/tmp/ocrmypdf.io._0jwju8h/000005_ocr_hocr', 'hocr', 'txt']
   INFO ocrmypdf._exec.tesseract -    5  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    5  [tesseract] [DS] Device[1] 0:(null) score is 0.197885
   INFO ocrmypdf._exec.tesseract -    5  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  DEBUG ocrmypdf.hocrtransform._hocr -    5  pikepdf.Matrix(0.24, 0, 0, -0.24, 0, 414.72)
  DEBUG ocrmypdf.hocrtransform._hocr -    5  pikepdf.Matrix(1, 0, 0, 1, 0, 1728)
  DEBUG ocrmypdf._graft -    5  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 90) -> 270
  DEBUG ocrmypdf._graft -    5  Grafting
  DEBUG ocrmypdf._graft -    5  Grafting with ctm pikepdf.Matrix(6.12323e-17, 1, -1, 6.12323e-17, 414.72, 0)
  DEBUG ocrmypdf._graft -    5  Page rotation: (content, auto) -> page = (90, 0) -> 90

   INFO ocrmypdf._pipelines.ocr - Postprocessing...
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.optimize - xref 21: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-CULJawvqh9ta0pjNun0Uiw in page 0
  DEBUG ocrmypdf.optimize - xref 22: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 20: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 19: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-qntw7-IZlNfdI8oY_HnKZA in page 1
  DEBUG ocrmypdf.optimize - xref 27: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 25: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 28: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 26: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 32: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 33: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 31: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 34: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-ze9_Ru8HAjoQB4Gg_zb23Q in page 2
  DEBUG ocrmypdf.optimize - xref 39: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 37: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 40: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-r9juWJ2yacyE0bNHx8TAIw in page 3
  DEBUG ocrmypdf.optimize - xref 38: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-DfGL5-UczTlYS58sUZ-Zgw in page 4
  DEBUG ocrmypdf.optimize - xref 45: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 44: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 46: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 43: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 20: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=20, ext='.png')
  DEBUG ocrmypdf.optimize - xref 22: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=22, ext='.png')
  DEBUG ocrmypdf.optimize - xref 25: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=25, ext='.png')
  DEBUG ocrmypdf.optimize - xref 27: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=27, ext='.png')
  DEBUG ocrmypdf.optimize - xref 31: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=31, ext='.png')
  DEBUG ocrmypdf.optimize - xref 33: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=33, ext='.png')
  DEBUG ocrmypdf.optimize - xref 37: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=37, ext='.png')
  DEBUG ocrmypdf.optimize - xref 39: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=39, ext='.png')
  DEBUG ocrmypdf.optimize - xref 43: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=43, ext='.png')
  DEBUG ocrmypdf.optimize - xref 45: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - XrefExt(xref=45, ext='.png')
  DEBUG ocrmypdf.optimize - Optimizable images: JPEGs: 0 PNGs: 10

  DEBUG ocrmypdf.optimize - xref 21: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-CULJawvqh9ta0pjNun0Uiw in page 0
  DEBUG ocrmypdf.optimize - xref 22: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 20: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 19: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-qntw7-IZlNfdI8oY_HnKZA in page 1
  DEBUG ocrmypdf.optimize - xref 27: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 25: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 28: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 26: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 32: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 33: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 31: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 34: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-ze9_Ru8HAjoQB4Gg_zb23Q in page 2
  DEBUG ocrmypdf.optimize - xref 39: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 37: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 40: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-r9juWJ2yacyE0bNHx8TAIw in page 3
  DEBUG ocrmypdf.optimize - xref 38: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-DfGL5-UczTlYS58sUZ-Zgw in page 4
  DEBUG ocrmypdf.optimize - xref 45: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 44: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 46: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 43: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 20: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 20: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 22: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 22: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 25: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 25: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 27: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 27: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 31: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 31: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 33: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 33: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 37: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 37: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 39: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 39: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 43: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 43: marking this JPEG as deflatable
  DEBUG ocrmypdf.optimize - xref 45: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 45: marking this JPEG as deflatable

  DEBUG ocrmypdf.optimize - xref 21: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-CULJawvqh9ta0pjNun0Uiw in page 0
  DEBUG ocrmypdf.optimize - xref 22: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 20: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 19: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-qntw7-IZlNfdI8oY_HnKZA in page 1
  DEBUG ocrmypdf.optimize - xref 27: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 25: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 28: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 26: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 32: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 33: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 31: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 34: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-ze9_Ru8HAjoQB4Gg_zb23Q in page 2
  DEBUG ocrmypdf.optimize - xref 39: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 37: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 40: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-r9juWJ2yacyE0bNHx8TAIw in page 3
  DEBUG ocrmypdf.optimize - xref 38: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-DfGL5-UczTlYS58sUZ-Zgw in page 4
  DEBUG ocrmypdf.optimize - xref 45: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 44: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 46: treating as an optimization candidate
  DEBUG ocrmypdf.optimize - xref 43: treating as an optimization candidate
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 19: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 20: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 21: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 22: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.optimize - xref 25: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 26: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 27: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 28: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 31: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 32: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 33: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 34: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 37: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 38: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 39: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 40: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 43: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 44: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - xref 45: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.optimize - xref 46: skipping image with unsupported Decode table
  DEBUG ocrmypdf.optimize - Optimizable images: JBIG2 groups: 0

  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io._0jwju8h/optimize.opt.pdf, /tmp/ocrmypdf.io._0jwju8h/optimize.pdf)
  DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['pngquant', '--version']
   INFO ocrmypdf._pipeline - Image optimization ratio: 1.01 savings: 1.0%
   INFO ocrmypdf._pipeline - Total file size ratio: 1.00 savings: -0.4%
  DEBUG ocrmypdf._pipeline - /tmp/ocrmypdf.io._0jwju8h/optimize.pdf -> /tmp/output_1963043256439923823.pdf
  ERROR ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 141189): error decoding stream data for object 21 0: Not a JPEG file: starts with 0x78 0x9c
WARNING ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 141189): stream will be re-processed without filtering to avoid data loss
  ERROR ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 148801): error decoding stream data for object 26 0: Not a JPEG file: starts with 0x78 0x9c
WARNING ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 148801): stream will be re-processed without filtering to avoid data loss
  ERROR ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 243867): error decoding stream data for object 31 0: Not a JPEG file: starts with 0x78 0x9c
WARNING ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 243867): stream will be re-processed without filtering to avoid data loss
  ERROR ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 315601): error decoding stream data for object 36 0: Not a JPEG file: starts with 0x78 0x9c
WARNING ocrmypdf.helpers - WARNING: /tmp/output_1963043256439923823.pdf (offset 315601): stream will be re-processed without filtering to avoid data loss
WARNING ocrmypdf._pipelines._common - Output file: The generated PDF is INVALID
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:190)
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:85)
    at stirling.software.SPDF.controller.api.misc.OCRController.processPdfWithOCR(OCRController.java:148)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:914)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:547)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:614)
    at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1614)
    at org.eclipse.jetty.ee10.websocket.servlet.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:195)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at stirling.software.SPDF.config.MetricsFilter.doFilterInternal(MetricsFilter.java:61)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:113)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1547)
    at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:824)
    at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:436)
    at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:464)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:575)
    at org.eclipse.jetty.ee10.servlet.SessionHandler.handle(SessionHandler.java:703)
    at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:858)
    at org.eclipse.jetty.server.Server.handle(Server.java:181)
    at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:648)
    at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:403)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99)
    at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:478)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:441)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:311)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:979)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1209)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1164)
    at java.base/java.lang.Thread.run(Thread.java:1583)

Additional Information

No response

Browsers Affected

Microsoft Edge

No Duplicate of the Issue

Frooodle commented 1 month ago

Can you reproduce it on other public instances like https://stirlingpdf.io/ or https://pdf.adminforge.de/

and does it happen for all PDFs? Are you able to share it?

Frooodle commented 1 month ago

And which settings are you running for the OCR?

jotpunktopunkt commented 1 month ago

image that are the settings i'm using i can't share the pdf file in question, but i will make a test-pdf and see if i can reproduce it, so i can share it i'll report back asap

and if it is of relevance: it's a proxmox vm, running dietpi, with 2 cores. docker and portainer agent installed. docker compose is being used withe the config above. i access stirling pdf via nginx proxy manager

jotpunktopunkt commented 1 month ago

interestingly, i get working documents, when i use the "correct skewed angle" option image

maybe, if it is relevant, all pdfs "OCR'ed" so far are scans of papers...so yes, they could have been slightly skewed, but i don't know why that would trigger an exit code 4 and an invalid file

edit: would it help you if i'd send you one of the files? is there a way to pm it or such?

Frooodle commented 1 month ago

You can send it me on discord but no other way, What happens if you choose OCR mode as Forced? does that also work?

jotpunktopunkt commented 1 month ago

force ocr does work, thank you! but why is that? what's the difference or to what mistake on my side does the exit code 4 hint? should i prep files differently?

Frooodle commented 1 month ago

Im not actually sure what the cause is, normally its font issues and forced converts it into image before running OCR