gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Invalid data for decryption, need 32 + 16*n bytes, probably parser error #297

Closed Houdini closed 4 months ago

Houdini commented 5 months ago

Hello,

First, thanks for great hexapdf gem and it's support!

I found very strange bug, that looks like:

➜ $ hexapdf inspect /tmp/private_file.pdf 411
Object (411,0): Invalid data for decryption, need 32 + 16*n bytes

I investigate the problem a bit, Please check the screenshot:

Screenshot 2024-03-31 at 01 09 59

Selected text is exactly object 411 string where it fails. Number of bytes in the brackets is 128, but hexapdf thinks it's 127. The difference comes from this method parse_literal_string of tokenizer.rb. It has the code:

data = scan_until(/[()\\\r]/)
...
str << data
...
        when 92 # \\
          str.chop!
          prepare_string_scanner(3)
          byte = @ss.get_byte
          if (data = LITERAL_STRING_ESCAPE_MAP[byte])
            str << data

Backslash (92) has 5C, So hexapdf face 5C, make a split, enter loop again and somehow loose that byte. And in the result this chunk of data has size of 127 bytes, not 128 (Minus 5C and somehow it also replace 6E with 0A, has no idea where 0A comes from).

I tried to fixe this mistake, but few things really confused me:

  1. It says that this part is: /Filter/Standard/Length 256, but it really contains only 128 bytes
  2. /O according to documentation (7.6.4.2) has size depends on R, but there are no R in this chunk, but any way is not 128 bytes no matter of R
  3. It seems that tokenzer just skip all this math anyway

Could you please help me with this bug?

P.S. please advise me how to send you PDF file privatly?

gettalong commented 5 months ago

Thanks for reporting! You can send the PDF to info@gettalong.at, I will scrub it from the mail and local storage once analyzed.

The code turns the two-byte sequence \n (0x5C 0x6E) into the newline character (which has the hex code 0x0A), see section 7.3.4.2 for details.

gettalong commented 5 months ago

If you look directly under the last byte of the selection in the picture, you see that /R has a value of 6. Furthermore, the /O entry indeed has many more bytes but only the first 48 bytes are actually not 0x00. Not sure if the file works correctly, though.

gettalong commented 4 months ago

Thanks for providing the file! I can reproduce the problem and will investigate.

gettalong commented 4 months ago

So, the problem here is twofold:

As you can see multiple small things come here together to make HexaPDF choke on this file. Maybe the best way forward in this and similar cases would be to just ignore the decryption error because it would not make the document more corrupt.

gettalong commented 4 months ago

@Houdini I have updated HexaPDF and added a new configuration option to deal with decryption problems. The default is now changed to be more relaxed if a decryption problem might not make the situation worse as is in this case.

So once the new HexaPDF version is released, HexaPDF will handle the file in question without problems.

Houdini commented 3 months ago

@gettalong Please, do you know when new version of the HexaPDF will be released?

gettalong commented 3 months ago

@Houdini Sorry for the delay. I wanted to include another fix/feature in the new release that took a bit longer than anticipated. However, that code is now in and I will release a new version this week.