coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
Other
581 stars 42 forks source link

cpdf encountered an error. No or malformed /Pages #46

Closed soubhikchatterjee closed 2 years ago

soubhikchatterjee commented 4 years ago

@johnwhitington

I am trying to fetch the page count of some pdf files, while cpdf was able to show the page count of majority of the pdfs, but for some pdfs it threw an error saying the pdf is corrupt, but when i opened the pdf, they seemed okay to me.

Just in case you want to try out, here are some of those pdf files that cpdf claimed to be corrupt.

Soubhiks-MacBook-Pro:mac soubhikchatterjee$ ./cpdf -pages "/Users/soubhikchatterjee/Downloads/SampleScan.pdf"
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/

cpdf encountered an error. Technical details follow:

No or malformed /Pages

Why cpdf is behaving this way for some pdf files?

johnwhitington commented 4 years ago

The documents have indexes which are off-by-one, leading to cpdf picking up the wrong object to look at for the page tree. Here is the diagnostic output of another PDF tool:

WARNING: SampleScan.pdf: reported number of objects (55) inconsistent with actual number of objects (56)
WARNING: SampleScan.pdf: file is damaged
WARNING: SampleScan.pdf (object 53 0, file position 16342138): expected 53 0 obj
WARNING: SampleScan.pdf: Attempting to reconstruct cross-reference table
17

Unfortunately, scan software often produces malformed PDFs, because the files are created by software written specially for the task on the scanner. And, of course, they would not have shipped the scanner unless the files open in Adobe Reader, so that's their test -- looks find to them, so must be ok.

Cpdf has a way of throwing away the object index for such files, and simply reading the objects in order, but it is not exposed. I propose to add a flag to cpdf:

cpdf -read-as-if-malformed

This would allow you to get on with your job and work with these files. In the future, we would like to add auto-detection of such problems, and automatically treat them as malformed when we discover an inconsistency in the object table. The problem at the moment is that cpdf manages to read the file ok, and has no idea it has read the wrong objects.

If you are a commercial customer, I can do this in the next day or so. If you are not, it will have to wait for the next release.

soubhikchatterjee commented 3 years ago

Hello @johnwhitington

Thanks for the reply and sorry for replying late.

Do you have any timeline in mind for the next release containing this feature?

johnwhitington commented 3 years ago

It looks like the option already exists (undocumented) under the name -debug-malformed. Can you try that?

soubhikchatterjee commented 3 years ago

Thanks @johnwhitington

Appreciate that 👍

myangga commented 3 years ago

Hello. Maybe I'm wrong in your case, but I got same error message when trying to merge 2 PDF, but only some PDF. And found the difference: the ones working where PDF standard 1.3 while non working were standard 1.7. And my version of CPDF was 2.2. Now upgrading to 2.3.1 makes same documents and merge working all the time, no matter standard. I insist on the fact that merging was the fatal action here. But upgrading CPDF version solved issue.

micschk commented 3 years ago

[...] Here is the diagnostic output of another PDF tool:

WARNING: SampleScan.pdf: reported number of objects (55) inconsistent with actual number of objects (56)
WARNING: SampleScan.pdf: file is damaged
WARNING: SampleScan.pdf (object 53 0, file position 16342138): expected 53 0 obj
WARNING: SampleScan.pdf: Attempting to reconstruct cross-reference table
17

If I may ask; what PDF tool is this the output of? We're considering scanning all input files for issues ahead of processing them with cpdf, would this tool be a suitable candidate for doing that?

andyinsf commented 3 years ago

@johnwhitington Thanks, this worked for me in cpdf 2.3.1 (2017.09.01). Output is from HP Smart scanning software.