Closed soubhikchatterjee closed 2 years ago
The documents have indexes which are off-by-one, leading to cpdf picking up the wrong object to look at for the page tree. Here is the diagnostic output of another PDF tool:
WARNING: SampleScan.pdf: reported number of objects (55) inconsistent with actual number of objects (56)
WARNING: SampleScan.pdf: file is damaged
WARNING: SampleScan.pdf (object 53 0, file position 16342138): expected 53 0 obj
WARNING: SampleScan.pdf: Attempting to reconstruct cross-reference table
17
Unfortunately, scan software often produces malformed PDFs, because the files are created by software written specially for the task on the scanner. And, of course, they would not have shipped the scanner unless the files open in Adobe Reader, so that's their test -- looks find to them, so must be ok.
Cpdf has a way of throwing away the object index for such files, and simply reading the objects in order, but it is not exposed. I propose to add a flag to cpdf:
cpdf -read-as-if-malformed
This would allow you to get on with your job and work with these files. In the future, we would like to add auto-detection of such problems, and automatically treat them as malformed when we discover an inconsistency in the object table. The problem at the moment is that cpdf manages to read the file ok, and has no idea it has read the wrong objects.
If you are a commercial customer, I can do this in the next day or so. If you are not, it will have to wait for the next release.
Hello @johnwhitington
Thanks for the reply and sorry for replying late.
Do you have any timeline in mind for the next release containing this feature?
It looks like the option already exists (undocumented) under the name -debug-malformed. Can you try that?
Thanks @johnwhitington
Appreciate that 👍
Hello. Maybe I'm wrong in your case, but I got same error message when trying to merge 2 PDF, but only some PDF. And found the difference: the ones working where PDF standard 1.3 while non working were standard 1.7. And my version of CPDF was 2.2. Now upgrading to 2.3.1 makes same documents and merge working all the time, no matter standard. I insist on the fact that merging was the fatal action here. But upgrading CPDF version solved issue.
[...] Here is the diagnostic output of another PDF tool:
WARNING: SampleScan.pdf: reported number of objects (55) inconsistent with actual number of objects (56) WARNING: SampleScan.pdf: file is damaged WARNING: SampleScan.pdf (object 53 0, file position 16342138): expected 53 0 obj WARNING: SampleScan.pdf: Attempting to reconstruct cross-reference table 17
If I may ask; what PDF tool is this the output of? We're considering scanning all input files for issues ahead of processing them with cpdf, would this tool be a suitable candidate for doing that?
@johnwhitington Thanks, this worked for me in cpdf 2.3.1 (2017.09.01). Output is from HP Smart scanning software.
@johnwhitington
I am trying to fetch the page count of some pdf files, while cpdf was able to show the page count of majority of the pdfs, but for some pdfs it threw an error saying the pdf is corrupt, but when i opened the pdf, they seemed okay to me.
Just in case you want to try out, here are some of those pdf files that cpdf claimed to be corrupt.
Why cpdf is behaving this way for some pdf files?