Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files
https://stirlingpdf.com
MIT License
46.62k stars 3.8k forks source link

[Bug]: PDFs that are indigestible to OCRmyPDF cause error messages #1470

Open noseshimself opened 5 months ago

noseshimself commented 5 months ago

The Problem

If a PDF file has an "unusual color map" (whatever that is) and needs additional parameters for recognition (message: "Vibration-Sensor_Manuals_EU.pdf: Error occurred while consuming document Vibration-Sensor_Manuals_EU.pdf: ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.") tools like PDF-to-PDF/A are (expectably) throwing errors, too.

An example file will be attached. I hope. Vibration-Sensor_Manuals_EU.pdf

Version of Stirling-PDF

0.20.2

Last Working Version of Stirling-PDF

No response

Page Where the Problem Occurred

No response

Docker Configuration

No response

Relevant Log Output

java.io.IOException: Command process failed with exit code 1
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:192)
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:82)
    at stirling.software.SPDF.controller.api.converters.ConvertPDFToPDFA.pdfToPdfA(ConvertPDFToPDFA.java:56)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:261)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:189)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:917)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:829)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:914)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:590)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:658)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)

Additional Information

Yes, the color space of this file is completely messed up. I wonder if it is worth checking this before passing files to external tools and using different parameters in case (see the OCRmyPDF message).

Browsers Affected

No response

No Duplicate of the Issue

Frooodle commented 5 months ago

I cant reproduce this on latest Stirling-PDF version using your exmaple file, i wonder if this was patched in later OCRMyPDF versions that docker has pulled

Frooodle commented 5 months ago

Are you able to update and see if you can reproduce? I also tried on our public instance and also saw no issue https://stirlingpdf.io/ocr-pdf

noseshimself commented 5 months ago

I'm now getting the expected error message:

Error
Internal Server Error:java.io.IOException: Command process failed with exit code 1. Error message: Start processing 4 pages concurrently 1 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 7 skipping all processing on this page 8 skipping all processing on this page 9 skipping all processing on this page 10 skipping all processing on this page 11 skipping all processing on this page 12 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page 16 skipping all processing on this page 17 skipping all processing on this page 18 skipping all processing on this page 19 skipping all processing on this page 20 skipping all processing on this page 21 skipping all processing on this page 22 skipping all processing on this page 23 skipping all processing on this page 24 skipping all processing on this page 25 skipping all processing on this page 26 skipping all processing on this page 27 skipping all processing on this page 28 skipping all processing on this page 29 skipping all processing on this page 30 skipping all processing on this page 31 skipping all processing on this page 32 skipping all processing on this page 33 skipping all processing on this page 34 skipping all processing on this page 35 skipping all processing on this page 36 skipping all processing on this page 37 skipping all processing on this page 38 skipping all processing on this page 39 skipping all processing on this page 40 skipping all processing on this page 41 skipping all processing on this page 42 skipping all processing on this page 43 skipping all processing on this page 44 skipping all processing on this page 45 skipping all processing on this page 46 skipping all processing on this page 47 skipping all processing on this page 48 skipping all processing on this page 49 skipping all processing on this page 50 skipping all processing on this page 51 skipping all processing on this page Postprocessing... ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.

image

See the last few words: ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.

I guess additional parameters are necessary in this case

Frooodle commented 5 months ago

still very confused that i dont see any error using your example file in https://stirlingpdf.io/ocr-pdf Are you not using docker?

sharifm-informatica commented 3 months ago

Tried this on https://stirlingpdf.io/ocr-pdf with another file. Might be rleated

java.io.IOException: Command process failed with exit code 15. Error message: DEBUG ocrmypdf - ocrmypdf 16.1.1 DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version'] DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4 DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version'] DEBUG ocrmypdf.subprocess - Running: ['gs', '--version'] DEBUG ocrmypdf.subprocess - Found gs 10.3.1 DEBUG ocrmypdf.subprocess - Running: ['gs', '--version'] DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs'] DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat). [DS] Device[1] 0:(null) score is 0.197885

List of available languages in "/usr/share/tessdata/" (2): eng osd

DEBUG ocrmypdf.helpers - pikepdf mmap enabled DEBUG ocrmypdf.helpers - os.symlink(/tmp/input_13655885555280694422.pdf, /tmp/ocrmypdf.io.asboxfvx/origin) DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.asboxfvx/origin, /tmp/ocrmypdf.io.asboxfvx/origin.pdf) DEBUG root - Gathering info with 1 thread workers DEBUG ocrmypdf.helpers - pikepdf mmap enabled

DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3 DEBUG ocrmypdf.helpers - pikepdf mmap enabled DEBUG ocrmypdf._pipeline - 1 Rasterize with pngmono, rotation 0 DEBUG ocrmypdf.subprocess - 1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1', '-dLastPage=1', '-r600.000000x600.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.asboxfvx/origin.pdf'] DEBUG PIL.PngImagePlugin - 1 STREAM b'IHDR' 16 13 DEBUG PIL.PngImagePlugin - 1 STREAM b'iCCP' 41 2296 DEBUG PIL.PngImagePlugin - 1 iCCP profile name b'default_gray.icc' DEBUG PIL.PngImagePlugin - 1 Compression method 0 DEBUG PIL.PngImagePlugin - 1 STREAM b'pHYs' 2349 9 DEBUG PIL.PngImagePlugin - 1 STREAM b'tEXt' 2370 32 DEBUG PIL.PngImagePlugin - 1 STREAM b'IDAT' 2414 8192 DEBUG ocrmypdf._exec.ghostscript - 1 Rotating output by 0 DEBUG ocrmypdf.subprocess - 1 Running: ['tesseract', '-l', 'eng', '--psm', '2', '/tmp/ocrmypdf.io.asboxfvx/000001_rasterize.png', 'stdout'] DEBUG ocrmypdf._exec.tesseract - 1 Deskew angle: -2.807 DEBUG PIL.PngImagePlugin - 1 STREAM b'IHDR' 16 13 DEBUG PIL.PngImagePlugin - 1 STREAM b'iCCP' 41 2291 DEBUG PIL.PngImagePlugin - 1 iCCP profile name b'ICC Profile' DEBUG PIL.PngImagePlugin - 1 Compression method 0 DEBUG PIL.PngImagePlugin - 1 STREAM b'pHYs' 2344 9 DEBUG PIL.PngImagePlugin - 1 STREAM b'IDAT' 2365 65536 DEBUG PIL.PngImagePlugin - 1 STREAM b'IHDR' 16 13 DEBUG PIL.PngImagePlugin - 1 STREAM b'iCCP' 41 2291 DEBUG PIL.PngImagePlugin - 1 iCCP profile name b'ICC Profile' DEBUG PIL.PngImagePlugin - 1 Compression method 0 DEBUG PIL.PngImagePlugin - 1 STREAM b'pHYs' 2344 9 DEBUG PIL.PngImagePlugin - 1 STREAM b'IDAT' 2365 65536 DEBUG ocrmypdf._pipeline - 1 resolution (599.9988, 599.9988) DEBUG ocrmypdf._pipeline - 1 convert DEBUG PIL.PngImagePlugin - 1 STREAM b'IHDR' 16 13 DEBUG PIL.PngImagePlugin - 1 STREAM b'iCCP' 41 2291 DEBUG PIL.PngImagePlugin - 1 iCCP profile name b'ICC Profile' DEBUG PIL.PngImagePlugin - 1 Compression method 0 DEBUG PIL.PngImagePlugin - 1 STREAM b'pHYs' 2344 9 DEBUG PIL.PngImagePlugin - 1 STREAM b'IDAT' 2365 65536 DEBUG img2pdf - 1 PIL format = PNG DEBUG img2pdf - 1 imgformat = PNG DEBUG img2pdf - 1 input dpi = 600 x 600 DEBUG img2pdf - 1 rotation = 0° DEBUG img2pdf - 1 input colorspace = 1 DEBUG img2pdf - 1 width x height = 1342px x 6492px DEBUG img2pdf - 1 read_images() embeds a PNG

ERROR ocrmypdf._pipelines._common - An exception occurred while executing the pipeline Traceback (most recent call last): File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler return fn(options, plugin_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 191, in _run_pipeline optimize_messages = exec_concurrent(context, executor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 118, in exec_concurrent executor( File "/usr/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in call self._execute( File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute result = future.result() ^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in get_result raise self._exception File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 79, in _exec_page_sync ocr_image_out, pdf_page_from_image_out, orientation_correction = process_page( ^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 407, in process_page pdf_page_from_image_out = create_pdf_page_from_image( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 727, in create_pdf_page_from_image img2pdf.convert( File "/usr/lib/python3.12/site-packages/img2pdf.py", line 2739, in convert pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"]( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/img2pdf.py", line 2499, in layout_fun raise NegativeDimensionError( img2pdf.NegativeDimensionError: one border dimension is larger than half of the respective page dimension at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:190) at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:85) at stirling.software.SPDF.controller.api.misc.OCRController.processPdfWithOCR(OCRController.java:148) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014) at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:914) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:547) at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:614) at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736) at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1614) at org.eclipse.jetty.ee10.websocket.servlet.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:195) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at stirling.software.SPDF.config.MetricsFilter.doFilterInternal(MetricsFilter.java:61) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:113) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116) at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205) at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1586) at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1547) at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:824) at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:436) at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:464) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:575) at org.eclipse.jetty.ee10.servlet.SessionHandler.handle(SessionHandler.java:703) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:858) at org.eclipse.jetty.server.Server.handle(Server.java:181) at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:648) at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:403) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99) at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:478) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:441) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:311) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:979) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1209) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1164) at java.base/java.lang.Thread.run(Thread.java:1583)