Invalid data for decryption, need 32 + 16*n bytes, probably parser error

Houdini commented 8 months ago

Hello,

First, thanks for great hexapdf gem and it's support!

I found very strange bug, that looks like:

➜ $ hexapdf inspect /tmp/private_file.pdf 411
Object (411,0): Invalid data for decryption, need 32 + 16*n bytes

I investigate the problem a bit, Please check the screenshot:

Screenshot 2024-03-31 at 01 09 59

Selected text is exactly object 411 string where it fails. Number of bytes in the brackets is 128, but hexapdf thinks it's 127. The difference comes from this method parse_literal_string of tokenizer.rb. It has the code:

data = scan_until(/[()\\\r]/)
...
str << data
...
        when 92 # \\
          str.chop!
          prepare_string_scanner(3)
          byte = @ss.get_byte
          if (data = LITERAL_STRING_ESCAPE_MAP[byte])
            str << data

Backslash (92) has 5C, So hexapdf face 5C, make a split, enter loop again and somehow loose that byte. And in the result this chunk of data has size of 127 bytes, not 128 (Minus 5C and somehow it also replace 6E with 0A, has no idea where 0A comes from).

I tried to fixe this mistake, but few things really confused me:

It says that this part is: /Filter/Standard/Length 256, but it really contains only 128 bytes
/O according to documentation (7.6.4.2) has size depends on R, but there are no R in this chunk, but any way is not 128 bytes no matter of R
It seems that tokenzer just skip all this math anyway

Could you please help me with this bug?

P.S. please advise me how to send you PDF file privatly?

gettalong commented 8 months ago

Thanks for reporting! You can send the PDF to info@gettalong.at, I will scrub it from the mail and local storage once analyzed.

The code turns the two-byte sequence \n (0x5C 0x6E) into the newline character (which has the hex code 0x0A), see section 7.3.4.2 for details.

gettalong commented 8 months ago

If you look directly under the last byte of the selection in the picture, you see that /R has a value of 6. Furthermore, the /O entry indeed has many more bytes but only the first 48 bytes are actually not 0x00. Not sure if the file works correctly, though.

gettalong commented 7 months ago

Thanks for providing the file! I can reproduce the problem and will investigate.

gettalong commented 7 months ago

So, the problem here is twofold:

The values of the /O and /U keys in the encryption dictionary are invalid because they contain more bytes than allowed. When decoded from the in-file representation, they are 127 bytes instead of 48 bytes.
The file has multiple revisions and the initial revision was already encrypted. This means that there may also be multiple encryption dictionaries stored in the file which is the case here. Usually, those dictionaries have the same object number and are just repeated (they need to contain the same data, otherwise the prior revisions cannot be decrypted correctly). However, in case of this file the initial encryption dictionary had object number 411,0 and the newer ones 488,0. Additionally, the initial revision where the encryption dictionary with object number 411,0 was used, was rewritten (the file was signed and the signature not appended in a new revision but the initial revision appropriately modified), so all revisions actually use the encryption dictionary with the object number 488,0. And 411,0 is just another object.

So while the errors in the /O and /U keys of the newer and actually used encryption dictionary are corrected before it is used, this is not so for the initial one. When HexaPDF resolves object 411, it is just another dictionary and nothing special in its eyes. Therefore it tries to decrypt it (although the result would not be correct) and runs into the problem of those two keys having only 127 bytes.
There are provisions in HexaPDF to detect the encryption dictionaries of all revisions. However, because the initial encryption dictionary 411,0 was not removed when the file was rewritten using another encryption dictionary with object number 488,0, it cannot be easily detected. AND the writer of the initial revision (Acrobat Distiller 8.1.0 for Windows) didn't encrypt that just another dictionary but left it as is, as if it were indeed an encryption dictionary.

As you can see multiple small things come here together to make HexaPDF choke on this file. Maybe the best way forward in this and similar cases would be to just ignore the decryption error because it would not make the document more corrupt.

gettalong commented 7 months ago

@Houdini I have updated HexaPDF and added a new configuration option to deal with decryption problems. The default is now changed to be more relaxed if a decryption problem might not make the situation worse as is in this case.

So once the new HexaPDF version is released, HexaPDF will handle the file in question without problems.

Houdini commented 7 months ago

@gettalong Please, do you know when new version of the HexaPDF will be released?

gettalong commented 7 months ago

@Houdini Sorry for the delay. I wanted to include another fix/feature in the new release that took a bit longer than anticipated. However, that code is now in and I will release a new version this week.

gettalong / hexapdf

Invalid data for decryption, need 32 + 16*n bytes, probably parser error #297