boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
734 stars 155 forks source link

[FIX] Under Some Conditions the indirect_reference_id And indirect_generation_number Seem Incorrect #160

Closed berniechiu closed 5 years ago

berniechiu commented 5 years ago

Description

We're having a project that is about to upgrade from 0.2.x to 1.x.x, but we encountered some of our PDFs failed to parse in our test suites. I'm not familiar with PDF encode/decode, so I made a patch to make sure it works correctly as before. Hope this is the proper way to modify it, thanks for the review.

Issues Screenshots

Screenshot 2019-06-20 18 17 10

Screenshot 2019-06-20 18 31 52

boazsegev commented 5 years ago

@berniechiu ,

Thank you for exposing this issue and opening the PR.

Before merging the PR, I want to make sure that we aren't fixing a symptom rather than fixing the cause.

From the description of the problem, it seems possible that somewhere in the code, CombinePDF is writing the :referenced_object value into the :indirect_generation_number key.

If this is the reason the PDF fails, then the current way to fix this is to make sure CombinePDF writes the correct data to :indirect_generation_number and :referenced_object.

Any chance you could send me a PDF file and a short example that will recreate the issue?

Kindly, Bo.

berniechiu commented 5 years ago

Thanks for the quick response!!

This is the raw string we're testing against

pdf_string.txt

My test is fairly simple, just like this

str = "abovestring"
CombinePDF.parse(str)
boazsegev commented 5 years ago

Hi @berniechiu ,

Thank you for sending me the PDF string.

I managed to find the root cause of the issue. Sadly, the issue is because the PDF appears to contain illegal names.

For example, the PDF contains the name /DeviceR<ARAMEX_HK_ACCOUNT_COUNTRY_CODE>.

However, according to the PDF standard, the < and > characters are reserved and aren't allowed. As specified in section 7.2.2:

The delimiter characters (, ), <, >, [, ], {, }, /, and % are special (LEFT PARENTHESIS (28h), RIGHT PARENTHESIS (29h), LESS-THAN SIGN (3Ch), GREATER-THAN SIGN (3Eh), LEFT SQUARE BRACKET (5Bh), RIGHT SQUARE BRACKET (5Dh), LEFT CURLY BRACE (7Bh), RIGHT CURLY BRACE (07Dh), SOLIDUS (2Fh) and PERCENT SIGN (25h), respectively). They delimit syntactic entities such as arrays, names, and comments. Any of these characters terminates the entity preceding it and is not included in the entity. Delimiter characters are allowed within the scope of a string when following the rules for composing strings; see 7.3.4.2, “Literal Strings”. The leading ( of a string does delimit a preceding entity and the closing ) of a string delimits the string’s end.

To fix the issue, I can change line #234 in the parser, removing the \x3c\x3e from the disallowed list of characters...

...however, this will cause parsing to fail for valid PDF files that use these delimiters properly.

The correct way to fix this issue is to encode these names using the proper # hex encoding, as specified by the standard (/DeviceR#3cARAMEX_HK_ACCOUNT_COUNTRY_CODE#3e).

It might be an issue with the authoring library (I'm assuming Prawn?). Maybe it has an update that solves the issue? Or maybe it's something in the way it's used?

Kindly, Bo.

berniechiu commented 5 years ago

Yes, we’re using Prawn. I’ll be looking into that too. Thanks very much~~