Closed j4ffle closed 3 years ago
Can you supply an example file? This is probably very easy to fix with one, and very hard to fix without one.
If necessary, you can send it to me privately.
IAC-US37.pdf is one of the malformed files that it moves past quickly. GE-US23.pdf is one that it gets hung up on, the INTC-US32 file was too large to upload (1.75 GB) - which makes sense why it takes longer to try and open now that I see that.
Thanks. I've tried those files, and get the same results as you (except IAC-US37.pdf seems valid (but empty) here - did you send the right one?)
But the real problem here is the big one it makes no progress on, right? Can you use some large-file-sending service to send it to me? It's john at coherentgraphics dot co dot uk. Cpdf should always make progress towards failure or success on a malformed file, and should never hang. So this looks like a real bug.
Another option would be to add a command line flag to cpdf to tell it never to try to correct malformed files. The problem there, of course, is that you may have slightly malformed files which cpdf could easily fix, which would then not be get fixed.
Thanks for the file. Meanwhile, it turns out that I have previously added the functionality to fail immediately on malformed files:
gorge:jflake john$ cpdf -split-bookmarks 0 INTC-US32.pdf -o %%%%.pdf -error-on-malformed
get_single_pdf: failed to read malformed PDF file. Consider using -gs-malformed
gorge:jflake john$
So this would be a workaround until I can provide a proper fix.
If that flag is not in your cpdf version, please let me know, and I can provide a more up-to-date build.
That works. I'll use that option for now. Thank you!
Great. I'll keep your big file and update this bug when I find a fix.
This comment arrived via Github email, but did not appear in this bug, for some reason:
I just used the following (adding -gs gs -gs-malformed) and it was able to work past the file. It just produced an empty pdf, but it didn't get hung up on the file. The problem there is that the program flags it as a successful split instead of as a malformed pdf. But I can work with that.
/data/flakej/bin/cpdf -split-bookmarks 0 $file -gs gs -gs-malformed -o /data/flakej/Data/AnalystReports/pdf/$subDirName/$filePrefix"_"%%%".pdf" -error-on-malformed
Yes, I think -error-on-malformed and -gs-malformed should be mutually exclusive. I think I will make the combination an error.
Fixed the second problem in https://github.com/johnwhitington/cpdf-source/commit/40170283be604eddb71b3d9966d99e7cf264d2e7
Yes. I deleted my comment as I realized it didn’t make sense to use both.
Thanks!! Good addition to flag that combination as an error.
Fixed in https://github.com/johnwhitington/camlpdf/commit/5bbecaa8794253b6f5fa5e07fd276164b0581ace
(The file now fails reconstruction more quickly, after I removed a bottleneck. Still takes 15 minutes though, which is too long. Enough progress for now.)
I'm splitting pdfs on their bookmarks by using the following code in a loop:
The first example is the output when the program gets stuck on the malformed file and never seems to be able to move past it. The second example is when the program works through the malformed file in a few seconds. Nearly all the malformed files perform as in the second example, but there are a couple of files the program seems to never be able to move past. Notice the time stamps.
I terminated the program at 8 AM on 7/25/2020, more than 24 hours later, and this is the output.
In this second example, notice the time stamp for the file just prior to the malformed file and the time stamp when it has moved past the malformed file. It only takes a few seconds. From what I can tell, both files are corrupted. When I attempt to open the first file it does take a minute while attempting to open the file before telling me the document is damaged. When I try to open the second example it only takes a few seconds to try and open before telling me the document is damaged. I'd like to be able to get all the good files processed without getting hung up on the malformed files.