PDF looks broken when saved through HexaPDF

earthlingworks commented 11 months ago

Hi Thomas, we have a PDF that looks fine in Chrome and other PDF tools but when we run it through this script it looks broken.

(I'll follow up through email with the PDF)

require 'hexapdf'
path = 'output.pdf'
document = HexaPDF::Document.open(ARGV[0])
document.write(path, validate: false, optimize: true)
document2 = HexaPDF::Document.open(path)

gettalong commented 11 months ago

Thanks for the PDF - I can reproduce the problem. Will let you know when I have more information.

earthlingworks commented 11 months ago

Sounds good, thanks!

Ruben

Founder, http://www.bidsketch.com Twitter: http://twitter.com/bidsketch

On Mon, Sep 11, 2023 at 6:20 AM, Thomas Leitner < @.*** > wrote:

Thanks for the PDF - I can reproduce the problem. Will let you know when I have more information.

— Reply to this email directly, view it on GitHub ( https://github.com/gettalong/hexapdf/issues/261#issuecomment-1713868704 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAJKPST6CNVU2XUR4LMOCFDXZ4F3RANCNFSM6AAAAAA4SMB5XQ ). You are receiving this because you authored the thread. Message ID: <gettalong/hexapdf/issues/261/1713868704 @ github. com>

gettalong commented 10 months ago

@earthlingworks Sorry for the long wait! I have found the reason why this is happening but still need to devise a solution.

gettalong commented 10 months ago

@earthlingworks So... this happens because of a combination things.

When HexaPDF modifies the file, it produces a valid (as far as I can discern) PDF file. However, the single page object of the provided PDF gets the additional entries /Kids and /Count. Those two entries are not defined by the PDF spec for page objects but for page tree node objects (the objects used to organize pages in a tree structure). This shouldn't be a problem since PDF objects can be extended. However, some PDF readers won't render the PDF due to this while others do the right thing.

So next was to find out why HexaPDF adds those two entries. Oh my... :grin: The goal is to optimize the provided PDF file. If optimize: false is chosen when writing, everything works. So the problem lies with the optimization task. Drilling further down I got to the code where HexaPDF automatically adds missing fields to dictionary objects that are both required and have a default value (like most /Type fields).

In case of page tree objects, the /Kids and /Count are required fields as per the PDF spec but have no default value there. HexaPDF, however, does define default values for these two fields ([] and 0 respectively) for convenience reasons. So when a page tree object is instantiated those entries are automatically added. But why is this done to a page object?

And that's where things are a bit strange with the provided PDF file because it actually contains two catalog objects instead of one. The crucial point is that the second one references the single page object of the first one as root of the page tree, i.e. as page tree node! So the page object is interpreted as page tree object and the fields /Kids and /Count are added. Leading to the problem with some viewers.

One way to fix this is to run the optimization task outside writing and using the compact: true option which will remove the second catalog object since it is not referenced anywhere from the trailer upwards.

Additionally, I have added a fix without the need to change the code on your side. That fix may not be the most ideal one but this problem touches a central part of how HexaPDF works. So not breaking things is essential here and I think the fix does that.

earthlingworks commented 10 months ago

Got it (I think). Interesting. So many things to account for in PDFs, it's nuts. Ok, I'll pass this on to the team. As always, thank you!

gettalong commented 10 months ago

Yeah, there are a million+ ways to create PDFs that are valid or only slightly invalid but might trip up one library or another :)

The Arlington PDF model tries to improve the situation by providing a machine readable format for validating PDF files.

earthlingworks commented 10 months ago

We tried it out and it looks like it works if it's separate/before the write like this:

HexaPDF::Task::Optimize.call(document3, compact: true)
document3.write("3#{path}", validate: false)

But it doesn't work if the optimize is passed into the write like this:

document2.write("2#{path}", validate: false, optimize: true)

gettalong commented 10 months ago

@earthlingworks Have you tried the HexaPDF version of the devel branch? The changes are not yet released but will be probably this week on Friday or Saturday.

earthlingworks commented 10 months ago

Ah that’s probably it. I’ll check with the team and let you know tomorrow if that was indeed the case. Thanks!

earthlingworks commented 10 months ago

Hmm...ok, Nelson on our team said he tried it on that branch and same behavior. Same PDF file

gettalong commented 10 months ago

@NelsonDocsketch @earthlingworks Hmm...

When I run the following code:

require 'hexapdf'
path = 'output.pdf'
document = HexaPDF::Document.open(ARGV[0])
document.write(path, validate: false, optimize: true)
document2 = HexaPDF::Document.open(path)
document2.write(path+'2.pdf', validate: false, optimize: true)

with the PDF provided via e-mail I get two output files. Using diffpdf yields no difference between the PDF files and when opening them in a variety of PDF viewers I don't see anything broken (all involved PDFs have one page that looks fine).

Could you describe in more detail what is broken in which viewer?

NelsonDocsketch commented 10 months ago

Oh nvm, i just tried again in the devel branch and it works.

gettalong commented 10 months ago

Thanks!

gettalong / hexapdf

PDF looks broken when saved through HexaPDF #261

Ruben