boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
734 stars 155 forks source link

Cannot merge pages properly with 1.0.21 #185

Closed JunichiIto closed 3 years ago

JunichiIto commented 3 years ago

I combined two pdf files. One has two pages and another has one page. So the combined pdf should have three pages like this:

Screen Shot 2021-01-27 at 17 32 49

However, version 1.0.21 will lost the second page:

Screen Shot 2021-01-27 at 17 32 03

I got the following warnings but I don't know the reason why:

Couldn't connect reference for {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>6, :referenced_object=>nil}
couldn't follow reference!!! {:is_reference_only=>true, :referenced_object=>nil} not found!
couldn't follow reference!!! {:is_reference_only=>true, :referenced_object=>nil} not found!
couldn't follow reference!!! {:is_reference_only=>true, :referenced_object=>nil} not found!

You might be able to reproduce the issue with this script: https://github.com/JunichiIto/combine-pdf-pages-sandbox

Could you investigate this?

boazsegev commented 3 years ago

Hi @JunichiIto ,

Thank you for opening this issue.

Exploring the Issue

The core of the issue is that the file "sample.pdf" is malformed, providing the parser with false information. Allow me to explain:

The file "sample.pdf" contains 2 pages. The first page data is contain in a stream object that claims to be 1074 bytes long. However, that stream object is only 1020 bytes long, which causes the parser to skip more bytes than the actual object.

After skipping the first 1074 bytes, the parser doesn't see an endstream keyword, so it assumes that the stream may have been (improperly) extended and seeks the next available endstream keyword (which is after the end of the second page)...

Now the big question is: was it fair that the parser assumed that the Length property was less than it should have been or should the parse had "rewinded" itself and attempted to find the keyword again while ignoring the misleading endstream property?

This is a bit of a twist and the exact converse of the #184 issue where the problem was that PDF stream data might contain PDF the endstream PDF keyword as part of the content. To accommodate this possibility, the Length data was honored rather than ignored.

I'm not sure what the best approach here would be. Your PDF is obviously malformed (you can simply load it and re-save it with a different name to see the issue).

A solution?

The best solution is not to use malformed PDF files... though we can't always control that.

I haven't decided if I should try an approach where a misleading Length property would be detected and cause the parser to roll-back to its previous state, so the parser would retry seeking for the endstream keyword from the beginning of the stream.

The issue with this approach is that, within the specified byte length of the stream, the endstream keyword is allowed. If I seek for it within the stream data it might be valid text or data that's part of the stream.

For this reason, I'm inclined to mark this issue an "compatibility - won't fix".

Please let me know what you think.

So, if "sample.pdf" is malformed Why does the PDF reader work and show 2 pages?

PDF readers would overcome the issue by reading the binary offset table at the end of the file (called an XREF).

This is a table of binary offsets that contains the byte offset of each object in the PDF file. When reading a file from top to bottom this data is redundant. However, when displaying page 49 our of 1000 pages (like PDF screen readers might do), then this data speeds up load times by allowing readers random access to the different PDF streams and objects.

The Ruby parser ignores this redundant data since it doesn't use random access to the PDF data... and for this reason the parser can't recuperate from the misleading Length property.

However, your on-screen PDF reader reads the overlapping bytes twice using random access and byte offsets, allowing it to access the second PDF page.

JunichiIto commented 3 years ago

Hi @boazsegev ,

Thank you for your detailed explanation. I could understand the cause. The "sample.pdf" is just used in our RSpec tests. The files in production are different. So I replaced it with the production one, then the all pages were successfully merged! I won't use "sample.pdf" from now, so you can close this issue. Thank you!