PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 136648] a simple open() + save() adds extra content past the EOF, causing Adobe Reader to repair the file #166

Open PhilterPaper opened 3 years ago

PhilterPaper commented 3 years ago

Subject: | a simple open() + save() adds extra content past the EOF, causing Adobe Reader to repair the file

To: | bug-PDF-API2@rt.cpan.org From: | chrispy@synopsys.com Date: | Fri, 28 May 2021 21:13:13 -0400 If I do a simple open/save:

#!/usr/bin/perl
use PDF::API2;
my $pdf = PDF::API2->open('orig.pdf');
$pdf->saveas('rewritten.pdf');

the output file is identical to the input file, except for new content added after the original %%EOF (note the double %%EOF now):

6737
%%EOF
xref
0 1
0000000000 65535 f 
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <a2e78ef36ff1bc1c88233f0a2a324a39> <a2e78ef36ff1bc1c88233f0a2a324a39> ] /Info 1 0 R /Length 183 /Prev 6737 /Root 25 0 R /Size 64 /W [ 1 8 2 ] >>
startxref
7164
%%EOF

When the resulting file is opened in Adobe Reader, it is repaired (and a repair dialog appears/disappears very quickly). When the file is closed, Adobe Reader prompts to save the repaired/updated file. orig.pdf

PhilterPaper commented 3 years ago

at May 29 10:59:45 2021 PMPERRY@cpan.org - Correspondence added

Interesting. I tried this with the latest (2.040) PDF::API2, and it gave me the error

Error opening 'orig.pdf': Permission denied at C:/Strawberry/perl/site/lib/PDF/API2/Basic/PDF/File.pm line 231.

The orig.pdf file is Read/Write (I checked with the DOS attrib command, on Windows 10), so I don't know what it's complaining about, or why it's different from your run. Recent PDF::API2 releases (at least as far back as 2.039) check if you're trying to update a Read/Only PDF, so I don't know why it's unhappy. What version are you running?

I tried it with PDF::Builder 3.023-beta, and it ran OK. It did give warnings that objects 19-23, and 30, are children (Kids) of objects 24 and 26, but do not declare their Parent. That may or may not cause problems. Also, I see that orig.pdf, while declared to be version 1.4 (with 1.5 Version override), uses a cross reference stream (PDF-1.5) rather than a table. That might have something to do with the new object 64 (cross reference stream) appended to the end of the file. The rewritten.pdf appeared to be clean -- Adobe Acrobat Reader did not ask to save a "fixed" copy.

PhilterPaper commented 3 years ago

Sat May 29 12:00:14 2021 chrispitude@gmail.com - Correspondence added

Well that is strange! File permissions shouldn't be preserved through site attachments, so something else must be going on. I'm currently on Ubuntu 20.04, using the latest PDF::API2 (2.040 to test the fix for "133131: Fix incorrect endianness of 64-bit XRef stream entry widths"). I'll try installing Strawberry Perl in Windows 10 to see if I can reproduce the behavior.

PDF::Builder also writes extra content at the end, but (1) it's slightly different:

6737
%%EOF

64 0 obj << /Type /XRef /DecodeParms << /Columns 4 /Predictor 12 >> /Filter /FlateDecode /ID [ <a2e78ef36ff1bc1c88233f0a2a324a39> <a2e78ef36ff1bc1c88233f0a2a324a39> ] /Index [ 0 1 64 1 ] /Info 1 0 R /Length 16 /Prev 6737 /Root 25 0 R /Size 65 /W [ 1 2 1 ] >>
stream
xÚcb

and (2) it doesn't provoke Adobe Reader to repair the file.

I don't know enough about PDF to understand the difference, but I am attaching both output files if you're curious to have a look.

PDF::Builder also issues the following messages:

PDF Integrity Check: object 24.0 claims 19.0 as a child (/Kids), but 19.0 claims no Parent! PDF Integrity Check: object 24.0 claims 20.0 as a child (/Kids), but 20.0 claims no Parent! PDF Integrity Check: object 24.0 claims 21.0 as a child (/Kids), but 21.0 claims no Parent! PDF Integrity Check: object 24.0 claims 22.0 as a child (/Kids), but 22.0 claims no Parent! PDF Integrity Check: object 24.0 claims 23.0 as a child (/Kids), but 23.0 claims no Parent! PDF Integrity Check: object 26.0 claims 30.0 as a child (/Kids), but 30.0 claims no Parent!

Are these messages something I should relay back to the software developer? (The application is Oxygen XML Author, which uses Apache FOP internally for the actual publishing.)

PhilterPaper commented 3 years ago

Sat May 29 13:24:02 2021 PMPERRY@cpan.org - Correspondence added

Regarding the "no parent" messages, I would treat that as a "slightly suspicious" point that MIGHT give a clue where to look if nothing else pans out. You might bring it to the attention of the application developer, that it's generally good practice for a /Kid to declare their /Parent, although I don't think it's strictly required.

If the application is going to output a PDF with 1.5 features (such as a cross reference stream), it's desirable to make the version at the top 1.5 rather than 1.4, although setting the /Version in the Root object is legal. The orig.pdf is perfectly legal; I don't know if changing to 1.5 up top will make any difference to PDF::API2 (since I can't seem to test it). PDF::API2 is supposed to accept XRef Streams.

I find it suspicious that PDF::API2 created a new XRef Stream, but unlike PDF::Builder, it didn't seem to provide a stream for the object! Builder, like the original file's object 63 XRef Stream, provided a data stream for object 64 (the updated XRef Stream) AND made it an object, while API2 didn't do either. That may be a bug in API2; Steve will have to look at it.

PhilterPaper commented 3 years ago

Sat May 29 13:54:51 2021 chrispitude@gmail.com - Correspondence added

Phil, thanks for your suggestions! I'll forward this to the application developer and let them feed it into the Apache FOP support machine (if they choose to).

Just curious - why did PDF::Builder append a new object at all, if we're just opening and rewriting the same content?

PhilterPaper commented 3 years ago

Sat May 29 14:10:40 2021 PMPERRY@cpan.org - Correspondence added

Just curious - why did PDF::Builder append a new object at all, if we're just opening and rewriting the same content?

That's behavior inherited from PDF::API2. I presume it has something to do with orig.pdf being declared PDF 1.4 and then a 1.5 feature (cross reference stream) being found, triggering it to add something (a new XRef Stream) at the end, even though nothing was changed. As I said earlier, I think PDF::API2 got it wrong and should have added an object (with new data stream) rather than the non-object with no stream, but Steve will have to determine that.

PhilterPaper commented 3 years ago

Sat May 29 14:35:55 2021 chrispitude@gmail.com - Correspondence added

Hi Phil,

My publishing software has a knob to specify the output PDF version, so I sent it to 1.5 (new orig1.5.pdf file attached). I get the same behavior:

Hi Steve,

There might be something in this content at the end that needs to be written differently so that Adobe Reader doesn't feel the need to repair the file.

Also, what is the purpose of this new content appended at the end?

%%EOF
xref
0 1
0000000000 65535 f 
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 5884 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6272
%%EOF

Interestingly, if I load the rewritten file and rewrite it again:

my $pdf = PDF::API2->open('rewritten_from_API2_1.5.pdf');
$pdf->saveas('rewritten_from_API2_twice_1.5.pdf');

...an additional content chunk is appended at the end:

%%EOF
xref
0 1
0000000000 65535 f 
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 5884 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6272
%%EOF
xref
0 1
0000000000 65535 f 
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 6272 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6517
%%EOF
PhilterPaper commented 3 years ago

Wed Jun 16 14:00:06 2021 PMPERRY@cpan.org - Correspondence added

I certainly agree that if the original PDF is correct, there should be no reason to add on more content at the end in either PDF:::API2 or PDF::Builder. Note that the original should be absolutely correct -- if any reader asks permission to "save" it after opening, that means it thinks it found an error. If Builder is otherwise acting correctly (produces a legitimate, working PDF), I'll consider that to be a very minor bug, and probably won't get to it for a long time. I see that like API2, Builder adds more content (another xref stream) on a second open, even though the line 1 version was "1.5" this time around. In contrast, it looks like API2 may be botching its "repair", so Steve will have to address that one.

PhilterPaper commented 2 years ago

Claimed to be fixed in PDF::API2 2.041. At least, any (unnecessary?) updates to the PDF should now be safe -- I need to wade through some 119 commits (as of today) in the last 6 weeks, including the ones for this item, and see what the fix was. Hopefully it eliminates the unnecessary update, rather than just making it "legal".

PhilterPaper commented 1 year ago

I have put in the latest PDF::API2 changes (as of 2.043), and there is no real difference from PDF::Builder. Each save operation still adds a new XRef stream and trailer, which is not desirable. At least, nothing blows up, and no sign that Adobe Reader is attempting a "repair". I am going to leave this open (but low priority) in hopes of someday getting to the bottom of why the XRef stream still gets added (a second time).