Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files
GNU General Public License v3.0
35.13k stars 2.61k forks source link

[Bug]: PDFs that are indigestible to OCRmyPDF cause error messages #1470

Open noseshimself opened 1 month ago

noseshimself commented 1 month ago

The Problem

If a PDF file has an "unusual color map" (whatever that is) and needs additional parameters for recognition (message: "Vibration-Sensor_Manuals_EU.pdf: Error occurred while consuming document Vibration-Sensor_Manuals_EU.pdf: ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.") tools like PDF-to-PDF/A are (expectably) throwing errors, too.

An example file will be attached. I hope. Vibration-Sensor_Manuals_EU.pdf

Version of Stirling-PDF

0.20.2

Last Working Version of Stirling-PDF

No response

Page Where the Problem Occurred

No response

Docker Configuration

No response

Relevant Log Output

java.io.IOException: Command process failed with exit code 1
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:192)
    at stirling.software.SPDF.utils.ProcessExecutor.runCommandWithOutputHandling(ProcessExecutor.java:82)
    at stirling.software.SPDF.controller.api.converters.ConvertPDFToPDFA.pdfToPdfA(ConvertPDFToPDFA.java:56)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:261)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:189)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:917)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:829)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:914)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:590)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
    at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:658)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)

Additional Information

Yes, the color space of this file is completely messed up. I wonder if it is worth checking this before passing files to external tools and using different parameters in case (see the OCRmyPDF message).

Browsers Affected

No response

No Duplicate of the Issue

Frooodle commented 1 month ago

I cant reproduce this on latest Stirling-PDF version using your exmaple file, i wonder if this was patched in later OCRMyPDF versions that docker has pulled

Frooodle commented 1 month ago

Are you able to update and see if you can reproduce? I also tried on our public instance and also saw no issue https://stirlingpdf.io/ocr-pdf

noseshimself commented 1 month ago

I'm now getting the expected error message:

Error
Internal Server Error:java.io.IOException: Command process failed with exit code 1. Error message: Start processing 4 pages concurrently 1 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 7 skipping all processing on this page 8 skipping all processing on this page 9 skipping all processing on this page 10 skipping all processing on this page 11 skipping all processing on this page 12 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page 16 skipping all processing on this page 17 skipping all processing on this page 18 skipping all processing on this page 19 skipping all processing on this page 20 skipping all processing on this page 21 skipping all processing on this page 22 skipping all processing on this page 23 skipping all processing on this page 24 skipping all processing on this page 25 skipping all processing on this page 26 skipping all processing on this page 27 skipping all processing on this page 28 skipping all processing on this page 29 skipping all processing on this page 30 skipping all processing on this page 31 skipping all processing on this page 32 skipping all processing on this page 33 skipping all processing on this page 34 skipping all processing on this page 35 skipping all processing on this page 36 skipping all processing on this page 37 skipping all processing on this page 38 skipping all processing on this page 39 skipping all processing on this page 40 skipping all processing on this page 41 skipping all processing on this page 42 skipping all processing on this page 43 skipping all processing on this page 44 skipping all processing on this page 45 skipping all processing on this page 46 skipping all processing on this page 47 skipping all processing on this page 48 skipping all processing on this page 49 skipping all processing on this page 50 skipping all processing on this page 51 skipping all processing on this page Postprocessing... ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.

image

See the last few words: ColorConversionNeededError: The input PDF has an unusual color space. Use --color-conversion-strategy to convert to a common color space such as RGB, or use --output-type pdf to skip PDF/A conversion and retain the original color space.

I guess additional parameters are necessary in this case

Frooodle commented 1 month ago

still very confused that i dont see any error using your example file in https://stirlingpdf.io/ocr-pdf Are you not using docker?