PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

Xref regression in 3.015 since 3.013 #101

Closed kiwiroy closed 5 years ago

kiwiroy commented 5 years ago

When opening an affected PDF the following exception is received:

Malformed cross reference list in PDF file  -- no object 0 (free list head)

About the testfile

Validation from pdf-online:

File    testfile.pdf
Compliance  pdf1.4
Result  Document validated successfully.
Details
Validating file "testfile.pdf" for conformance level pdf1.4
The document's meta data is either missing or inconsistent or corrupt.
The document does conform to the PDF 1.4 standard.
Done.

References

PhilterPaper commented 5 years ago

Interesting. Thank you for the report. This testfile.pdf was produced elsewhere, and PDF::Builder gives a fatal exception because of the structure of the cross reference table (the listed message)? It's saying that there was no item "0" in the xref table. There was also no item "1" with only one subsection, which is (as I understand it) technically an illegal structure, but most Readers seem to give it a pass. In 3.014 I added the extra check for what appeared to me to be a catastrophically wrong xref table.

If you are reading in such a file, PDF::Builder does not actually parse and process the xref table, so possibly I could just give a warning (instead of a fatal error). If this PDF is otherwise passing validation (what is that "meta data" message about?), that may be satisfactory. Can you point to any documentation on what's allowed, that PDF::Builder isn't following? It would be great if you could copy-paste the xref table section here, to look at.

kiwiroy commented 5 years ago

@PhilterPaper yes an externally produced file which I have no control over. I agree it appears that the xref table is wrong. I'm on macOS and have exported the original using Preview which, as you guessed reads and renders the file fine, and resolves the fatal exception. However, there are still some warnings with the resultant file.

Warning: xref active object 70 entry with bad length 00000
Warning: xref active object 72 entry with bad length 00000

Unsure what the "meta data" message is about. I only linked there as it was previously linked in another ticket - I'd hoped for more information from that resource myself.

What's the best method of exporting the xref table from the file? It is publicly available at the URL here.

PhilterPaper commented 5 years ago

I take it that it's the School-Trustees-Booklet.pdf we're looking at? It appears that it has been updated at least three times, with new material appended on the end each time (the normal way), and sometimes new content inserted at the beginning (???), and in the process, it got a rather strange format (that Adobe Reader, nevertheless, can read). I don't know if the Reader is doing any fixup or cleanup, but it's not asking me to save the PDF when I'm done reading it, so maybe it isn't doing anything.

You have to use a text editor such as ViM on the PDF itself, to separate lines (if run together with ^M between them) and extract xref tables through copy and paste. Be careful not to SAVE the modified PDF, or it will probably be corrupted!

Here are the sections in question (around xref, startxref, and %%EOF). There are 33 object entries, starting with object 305, in the first xref table, which appears to start at around offset 116. Notice that it claims the xref is at offset 0, when actually it is approximately 116. That may be a problem. The startxref offset is supposed to be the location of xref:

%PDF-1.4
%âãÏÓ
305 0 obj
<</Linearized 1/L 809075/O 308/E 93766/N 28/T 802859/H [ 956 786]>>
endobj

xref             <---- at offset 116
305 33
0000000016 00000 n
0000001742 00000 n
0000001906 00000 n
...
0000070950 00000 n
0000093697 00000 n
0000000956 00000 n
trailer
<</Size 338
/Root 306 0 R
/Info 304 0 R
/ID[<F4571C85EF4F4F11BD5549F5ADF1E35A><722BD98D3FDB4116B4FC729C20B99321>]
/Prev 802849>>      <----  earlier content in section 2
startxref
0
%%EOF

Then the second section at xref offset 802849, with entries for 305 objects, but it claims to start back at 116 (the first). Note that this one has a proper "65535" free list start:

xref      <---- at offset 802849
0 305
0000000000 65535 f
0000093766 00000 n
0000094145 00000 n
...
0000788466 00000 n
0000788592 00000 n
0000802663 00000 n
trailer
<</Size 305
/ID[<F4571C85EF4F4F11BD5549F5ADF1E35A><722BD98D3FDB4116B4FC729C20B99321>]>>
startxref
116
%%EOF

The third addition to the PDF, with its own free list and 5 sections for objects 303, 304, 306, and 308. Presumably the offset for xref is at 825633. Note that it claims the previous (/Prev) section xref is at 116 (the first xref table!):

xref      <---- at offset 825633
0 1
0000000000 65535 f
303 2
0000809075 00000 n
0000825194 00000 n
306 1
0000825380 00000 n
338 1
0000825589 00000 n
trailer
<</Size 339
/Root 306 0 R
/Info 304 0 R
/ID[<F4571C85EF4F4F11BD5549F5ADF1E35A><6457BB3EF676410D9985A417E84FCEA3>]
/Prev 116>>     <---- earlier content in section 1
startxref
825633
%%EOF

Finally, the fourth section, again with a complex xref table (free list, objects 278, 303, and 304). Presumably the xref offset is at about 842352. It points back (/Prev) to section 3's xref at 825633:

304 0 obj
<</CreationDate(D:20170207150356+13'00')/Creator(Adobe InDesign CC 2017 \(Macintosh\))/ModDate(D:20170207150721+13'00')/Producer(Adobe PDF Library 15.0)/Trapped/False>>
endobj
xref          <----- at offset 842352
0 1
0000000000 65535 f
278 1
0000825913 00000 n
303 2
0000826047 00000 n
0000842166 00000 n
trailer
<</Size 339
/Root 306 0 R
/Info 304 0 R
/ID[<F4571C85EF4F4F11BD5549F5ADF1E35A><C8DE0E1C5312493CA1113FE139C715A6>]
/Prev 825633>>   <---- earlier content in section 3
startxref     <---- first table that will be found
842352
%%EOF    <---- very end of file

Most of the PDF sections contain a lot of metadata. Somewhere in there is something the validator evidently doesn't like. Everyone seems to agree that the Root is at object 306 and the Info is at object 304.

This structure is so bizarre that I'm going to have to think about it for a while. In the meantime, you could disable the fatal check in .../lib/PDF/Builder/Basic/PDF/File.pm around line 1282, change

            die "Malformed cross reference list in PDF file $self->{' fname'} -- no object 0 (free list head)\n";

to

         #  die "Malformed cross reference list in PDF file $self->{' fname'} -- no object 0 (free list head)\n";

and see if that helps. I may change that to a warn, but first I want to understand what's going on with this particular file. Section 4 points back to section 3 (/Prev), sections 1 and 3 point back to section 2, and section 2 claims to be the earliest(?). It's quite strange, to say the least, but apparently it's legal.

kiwiroy commented 5 years ago

@PhilterPaper changing the die to warn results in the following output:

Malformed cross reference list in PDF file  -- no object 0 (free list head)
Warning: object 0 next generation is not 65535.
Warning: corrupted free object list: next=0 is not a free object.

The additional messages from lines 1310 and 1314 correspond to what you've said above.

Unsurprisingly, adding $xlist->{'0'} = [0, 65535, 'f']; to the modified else block silences the two new messages, but I doubt that is fix.

PhilterPaper commented 5 years ago

There's some question as to whether any Reader still even uses the free block list. I'll have to think some more about whether it's worth the bother of checking the PDF xref table structure and issuing warnings (at least, regarding the free block list). Maybe I'll put them under a diagnostic flag or something, so you can run with them silenced.

Does it seem to be working correctly, despite the warning messages about the corrupted xref table? Adding the new entry (as you show) fixes the symptom, but I don't think it will fix the modified PDF when it's written back out.

PhilterPaper commented 5 years ago

OK, I think I have it fixed. If you could, would you try the latest changes? There are quite a few other changes, but if you look in Builder.pm and Basic/PDF/File.pm at the lines involving $options{'-diags'}, you can see that I changed all the 'die' to 'warn', and wrapped a test for -diags=>1 around them. The default behavior is now to not output any message (warning), but just let the bad PDF structure go through. If -diags=>1 is added to open(), a warning message is given and sometimes a fixup is attempted. Anyway, your test case seems to go through OK now. Please let me know if it seems to work well for you, and I'll release it as-is.

kiwiroy commented 5 years ago

@PhilterPaper all works without a hitch. Thank you.

PhilterPaper commented 5 years ago

Great! Closing ticket, will be in next release (3.016), by the end of August.