coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
Other
581 stars 42 forks source link

While batch processing PDFs, cpdf seems to get stuck on malformed file for hours #48

Closed j4ffle closed 3 years ago

j4ffle commented 3 years ago

I'm splitting pdfs on their bookmarks by using the following code in a loop:

/data/user/bin/cpdf -split-bookmarks 0 $file -o /data/user/Data/Reports/pdf/$subDirName/$filePrefix"_"%%%".pdf"

The first example is the output when the program gets stuck on the malformed file and never seems to be able to move past it. The second example is when the program works through the malformed file in a few seconds. Nearly all the malformed files perform as in the second example, but there are a couple of files the program seems to never be able to move past. Notice the time stamps.

filename=INTC-US30.pdf; duration=195; datetime=07/24/20 00:51:31; exitStatus=0
INTC-US31
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/

filename=INTC-US31.pdf; duration=195; datetime=07/24/20 00:51:40; exitStatus=0
INTC-US32
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/

Because of error Pdf.PDFError("Could not find EOF marker whilst reading file INTC-US32.pdf at position 1883779123"), will read as malformed.
Attempting to reconstruct the malformed pdf INTC-US32.pdf...
Terminated

I terminated the program at 8 AM on 7/25/2020, more than 24 hours later, and this is the output.

In this second example, notice the time stamp for the file just prior to the malformed file and the time stamp when it has moved past the malformed file. It only takes a few seconds. From what I can tell, both files are corrupted. When I attempt to open the first file it does take a minute while attempting to open the file before telling me the document is damaged. When I try to open the second example it only takes a few seconds to try and open before telling me the document is damaged. I'd like to be able to get all the good files processed without getting hung up on the malformed files.

filename=IAC-US36.pdf; duration=182; datetime=07/24/20 00:38:27; exitStatus=0
IAC-US37

Because of error Pdf.PDFError("Could not find EOF marker whilst reading file IAC-US37.pdf at position 1018210"), will read as malformed.
Attempting to reconstruct the malformed pdf IAC-US37.pdf...
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1011405")
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1011417")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1011433")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1011823")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1012569")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1012583")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1012598")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1012636")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1012734")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1013140")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1013445")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1013996")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014049")
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014073")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014101")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014345")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014367")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014386")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014410")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014474")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014527")
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014551")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1014579")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016572")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016607")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016629")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016648")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016672")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016692")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016871")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016886")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016895")
list length 0
Recovering from Lex error: Pdf.PDFError("Lexing Hexstring whilst reading file IAC-US37.pdf at position 1016904")
list length 0
Read 557 objects
get_single_pdf: failed to read malformed PDF file. Consider using -gs-malformed
filename=IAC-US37.pdf; duration=182; datetime=07/24/20 00:38:35; exitStatus=2
IAC-US38
johnwhitington commented 3 years ago

Can you supply an example file? This is probably very easy to fix with one, and very hard to fix without one.

If necessary, you can send it to me privately.

j4ffle commented 3 years ago

IAC-US37.pdf is one of the malformed files that it moves past quickly. GE-US23.pdf is one that it gets hung up on, the INTC-US32 file was too large to upload (1.75 GB) - which makes sense why it takes longer to try and open now that I see that.

IAC-US37.pdf GE-US23.pdf

johnwhitington commented 3 years ago

Thanks. I've tried those files, and get the same results as you (except IAC-US37.pdf seems valid (but empty) here - did you send the right one?)

But the real problem here is the big one it makes no progress on, right? Can you use some large-file-sending service to send it to me? It's john at coherentgraphics dot co dot uk. Cpdf should always make progress towards failure or success on a malformed file, and should never hang. So this looks like a real bug.

Another option would be to add a command line flag to cpdf to tell it never to try to correct malformed files. The problem there, of course, is that you may have slightly malformed files which cpdf could easily fix, which would then not be get fixed.

johnwhitington commented 3 years ago

Thanks for the file. Meanwhile, it turns out that I have previously added the functionality to fail immediately on malformed files:

gorge:jflake john$ cpdf -split-bookmarks 0 INTC-US32.pdf -o %%%%.pdf -error-on-malformed
get_single_pdf: failed to read malformed PDF file. Consider using -gs-malformed
gorge:jflake john$ 

So this would be a workaround until I can provide a proper fix.

If that flag is not in your cpdf version, please let me know, and I can provide a more up-to-date build.

j4ffle commented 3 years ago

That works. I'll use that option for now. Thank you!

johnwhitington commented 3 years ago

Great. I'll keep your big file and update this bug when I find a fix.

johnwhitington commented 3 years ago

This comment arrived via Github email, but did not appear in this bug, for some reason:

I just used the following (adding -gs gs -gs-malformed) and it was able to work past the file. It just produced an empty pdf, but it didn't get hung up on the file. The problem there is that the program flags it as a successful split instead of as a malformed pdf. But I can work with that.

/data/flakej/bin/cpdf -split-bookmarks 0 $file -gs gs -gs-malformed -o /data/flakej/Data/AnalystReports/pdf/$subDirName/$filePrefix"_"%%%".pdf" -error-on-malformed

johnwhitington commented 3 years ago

Yes, I think -error-on-malformed and -gs-malformed should be mutually exclusive. I think I will make the combination an error.

johnwhitington commented 3 years ago

Fixed the second problem in https://github.com/johnwhitington/cpdf-source/commit/40170283be604eddb71b3d9966d99e7cf264d2e7

j4ffle commented 3 years ago

Yes. I deleted my comment as I realized it didn’t make sense to use both.

Thanks!! Good addition to flag that combination as an error.

johnwhitington commented 3 years ago

Fixed in https://github.com/johnwhitington/camlpdf/commit/5bbecaa8794253b6f5fa5e07fd276164b0581ace

(The file now fails reconstruction more quickly, after I removed a bottleneck. Still takes 15 minutes though, which is too long. Enough progress for now.)