gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Invalid data for decryption #233

Closed earthlingworks closed 1 year ago

earthlingworks commented 1 year ago

Hey Thomas, we ran into an issue when flattening a PDF (without that part of the script it works fine).

The specific error: Invalid data for decryption, need 32 + 16*n bytes (HexaPDF::EncryptionError)

Will send the PDF through email. Script to reproduce here:

require 'hexapdf'
​
d = HexaPDF::Document.open(ARGV[0])
d.pages.each do |p|
  p.flatten_annotations
end 
path = 'output.pdf'
d.write(path, validate: false, optimize: true)
gettalong commented 1 year ago

Thanks for the bug report - I can confirm the problem.

gettalong commented 1 year ago

Ah, found the problem and it's a "funny" one.

When running your script, it shows that object 946 has an error while decrypting. However, when running hexapdf inspect gh233.pdf 946 the object gets successfully decrypted. Hmm... after adding some debug statements I found the problem.

HexaPDF doesn't load the whole PDF file while parsing, it only loads chunks. This is needed because internally HexaPDF uses StringScanner and this class only works for strings and not, alas, an IO.

When running hexapdf inspect gh233.pdf 946, the parser positions the read pointer at the start position of object 946 and loads 8192 bytes, enough to cover the whole serialized object. So far so good.

When running your script, the objects get loaded serially due to how optimization works. So first object 1, then 2, 3, 4, ... until 946. At this point the parser doesn't load additional bytes from the file since the last read covers at least the start of object 946. Then it encounters the encrypted string and right in the middle of this string the loaded bytes run out. This wouldn't be much of a problem since it normally means that the code would just load more data from the file and continue. But the cut point is right in the middle of an escape sequence and there it was not ensured that enough bytes are left to fully read the escape sequence. So only a part was read and on the next iteration the rest was read as normal numbers, leading to the problem of the invalid number of bytes (e.g. 2 more than expected).

Making sure that enough bytes are loaded for correct operation fixes this problem.

earthlingworks commented 1 year ago

Interesting. I actually think I followed most of that (glad you're the one figuring this stuff out vs me, hah). Thank you!