Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files
MIT License
39.86k stars 3.13k forks source link

OCR Error, No such file or directory: '/tmp/ocrmypdf.io.fez4ih5m/000001_ocr_hocr.hocr #1368

Closed StoiaCode closed 1 month ago

StoiaCode commented 3 months ago

I feel like im missing something obvious, but I cant quite pin it down. No matter what language or what PDF I am using, I get the following error.

java.io.IOException: Command process failed with exit code 15. Error message:   DEBUG ocrmypdf - ocrmypdf 16.1.1
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Found gs 10.2.1
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.261972
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (41):
deu
deu_frak
deu_latf
eng
script/Arabic
script/Armenian
script/Bengali
script/Canadian_Aboriginal
script/Cherokee
script/Cyrillic
script/Devanagari
script/Ethiopic
script/Fraktur
script/Georgian
script/Greek
script/Gujarati
script/Gurmukhi
script/HanS
script/HanS_vert
script/HanT
script/HanT_vert
script/Hangul
script/Hangul_vert
script/Hebrew
script/Japanese
script/Japanese_vert
script/Kannada
script/Khmer
script/Lao
script/Latin
script/Malayalam
script/Myanmar
script/Oriya
script/Sinhala
script/Syriac
script/Tamil
script/Telugu
script/Thaana
script/Thai
script/Tibetan
script/Vietnamese

  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/input_13094943151699031899.pdf, /tmp/ocrmypdf.io.fez4ih5m/origin)
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.fez4ih5m/origin, /tmp/ocrmypdf.io.fez4ih5m/origin.pdf)
  DEBUG root - Gathering info with 1 thread workers
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled

  DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3
  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  DEBUG ocrmypdf._pipeline -    1  Rasterize with pngmono, rotation 0
  DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1', '-dLastPage=1', '-r200.183607x200.183607', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.fez4ih5m/origin.pdf']
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2296
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'default_gray.icc'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2349 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'tEXt' 2370 32
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2414 8192
  DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2291
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2344 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2365 30609
  DEBUG ocrmypdf._pipeline -    1  resolution (200.1774, 200.1774)
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io.fez4ih5m/000001_ocr.png', '/tmp/ocrmypdf.io.fez4ih5m/000001_ocr_hocr', 'hocr', 'txt']
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Device[1] 0:(null) score is 0.261972
   INFO ocrmypdf._exec.tesseract -    1  [tesseract] [DS] Selected Device[1]: "(null)" (Native)
  ERROR ocrmypdf._exec.tesseract -    1  [tesseract] read_params_file: Can't open hocr
  ERROR ocrmypdf._exec.tesseract -    1  [tesseract] read_params_file: Can't open txt

  ERROR ocrmypdf._pipelines._common - An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler
    return fn(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 191, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 118, in exec_concurrent
    executor(
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
    self._execute(
  File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 82, in _exec_page_sync
    ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 64, in _image_to_ocr_text
    ocr_out = render_hocr_page(hocr_out, page_context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 750, in render_hocr_page
    if hocr.stat().st_size == 0:
       ^^^^^^^^^^^
  File "/usr/lib/python3.12/pathlib.py", line 840, in stat
    return os.stat(self, follow_symlinks=follow_symlinks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.fez4ih5m/000001_ocr_hocr.hocr'
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:190)
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:85)
    at stirling.software.SPDF.controller.api.misc.OCRController.processPdfWithOCR(OCRController.java:148)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:925)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:830)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:914)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:547)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:614)
    at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1614)
    at org.eclipse.jetty.ee10.websocket.servlet.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:195)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at stirling.software.SPDF.config.MetricsFilter.doFilterInternal(MetricsFilter.java:62)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:109)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586)
    at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1547)
    at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:814)
    at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:431)
    at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:464)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:571)
    at org.eclipse.jetty.ee10.servlet.SessionHandler.handle(SessionHandler.java:703)
    at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.eclipse.jetty.server.Server.handle(Server.java:179)
    at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:619)
    at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:411)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99)
    at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:478)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:441)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:410)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:971)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1201)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1156)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Frooodle commented 3 months ago

What's your usr/share/tessdata look like?

StoiaCode commented 3 months ago

image

Inside the script folder is this: (I guess I could clean that up, but does that matter? Ill try) image

And the tessconfigs folder is just empty.

Frooodle commented 3 months ago

How did you get these language packs?

StoiaCode commented 3 months ago

downloaded them from the github linked in the OCR guide. Took the easy way and just downloaded the entire repo as zip and transfered them via sftp.

Frooodle commented 1 month ago

Sorry for late reply, Does this still happen on latest version? closing ticket for now but happy to re-open if you can still reproduce It might be caused by the exact language files you use and/or directory pathing