Closed redfast00 closed 3 months ago
I can confirm that I can decrypt the document with the code in https://github.com/py-pdf/pypdf/pull/2537
Interesting... can you try with the previous release? 0.23.1
I can't test myself right now, unfortunately, but I'll take a look later.
Yes, with pip install 'pyHanko[pkcs11,image-support,opentype,xmp]==0.23.1'
, I get the same error.
Additionally, while trying to decrypt another document (that I also succesfully decrypted before with my patch to pypdf), I got:
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 780, in _import_object
imported = self._import_object(refd, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 793, in _import_object
raw_dict = {
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 794, in <dictcomp>
k: self._import_object(v, reference_map, obj_stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/writer.py", line 774, in _import_object
refd = obj.get_object()
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 529, in get_object
obj = self.reference.get_object()
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 204, in get_object
return self.pdf.get_object(self).get_object()
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/reader.py", line 425, in get_object
obj = self._read_object(
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/reader.py", line 478, in _read_object
retval = self._get_object_from_stream(
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/reader.py", line 291, in _get_object_from_stream
obj = generic.read_object(
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 246, in read_object
result = DictionaryObject.read_from_stream(
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 1273, in read_from_stream
key = read_object(stream, container_ref)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 240, in read_object
result = NameObject.read_from_stream(stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/generic.py", line 1124, in read_from_stream
name_bytes = read_until_delimiter(stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/misc.py", line 108, in read_until_delimiter
result = _read_until_class(PDF_WHITESPACE + PDF_DELIMITERS, stream)
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/misc.py", line 125, in _read_until_class
return b''.join(_build())
File "/home/user/Documents/secure/venv-oldpyhanko/lib/python3.10/site-packages/pyhanko/pdf_utils/misc.py", line 120, in _build
tok = stream.read(1)
RecursionError: maximum recursion depth exceeded while calling a Python object
Error: Generic processing error.
The full stacktrace is attached: stacktrace.txt
Hi @redfast00,
Those are all pretty interesting... There seems to be some kind of regression in 0.23.1 -> 0.23.2 regarding permission flags in files using the PubSec handler, but so far that appears to be unrelated to what you're seeing.
I see _get_object_from_stream
appear in both stack traces, so I suspect this to be a funny interaction between object streams (AKA "compressed objects") and file encryption. I'll investigate further along those lines...
Thank you for looking at it; I understand that this can be frustrating if you don't have the files, but I unfortunately really can't share them. If you need me to try some things out, feel free to ask :smile:
Hmmm, I tried quite a few permutations but could not reproduce what you're seeing. Can you tell me which software (purportedly) created your PDFs? And can you post the content of the /Encrypt
dictionary?
For the first document, can you try breakpointing the failure and check if (a) pyHanko is able to decrypt any strings / streams at all in that document, and (b) if it fails only on a specific one, whether that string "looks" encrypted? I'm starting to think that perhaps there's a string somewhere that has not been encrypted by mistake (I can think of some scenarios that could result in that kind of thing).
In the second issue, it appears that the public-key encryption part is working as it should: the bottom of the call stack seems to indicate that we're successfully looking inside an object stream somewhere, which requires said object stream to have been decrypted without issues. Rather, from the very high object ids in your stack trace, I think there's just a very deep path in the tree somewhere that is blowing up the stack (a particularly convoluted structure tree could do that, I suppose...). I didn't spot any loops in the stack trace (which _import_object
defends against, FWIW). If the issue is genuinely due to the object graph just being too big, the fix would be to rewrite _import_object
to not use recursion, which is of course possible but it requires some refactoring.
In the PDF metadata, there is:
For the small document (4 pages) that crashes with the value-error:
'/Creator': 'Acrobat PDFMaker 9.1 for Word',
'/Producer': 'Adobe PDF Library 9.0',
For the big document, that crashes with the recursion error:
{'/Creator': 'FrameMaker 16.0.4',
'/Producer': 'Acrobat Distiller 23.0 (Windows)'}
(identifying information removed)
By breakpointing, I was able to verify that the shared key calculates to the same that the working patch to pypdf does. It is indeed able to decrypt strings: most of the plaintexts look like hex garbage, but there is a pattern in them (they start with the same character); and some plaintexts contain readable metadata.
The place where it fails in aes_cbc_decrypt
, data
is b''
; resulting in plaintext
also being b''
, and there likely not being any more padding. I fixed this by adding a check for zero-length encrypted data (it will then just return b''
, just like pypdf does), and then indeed saw that it was trying to decrypt something that looks like a certificate. I added a check to return this as-is instead of decrypting; the new error I get is:
2024-03-26 10:48:33,584 - root - DEBUG - Running with --verbose
2024-03-26 10:48:33,584 - root - DEBUG - There was no configuration to parse.
Key passphrase:
2024-03-26 10:48:39,451 - tzlocal - DEBUG - /etc/timezone found, contents:
Europe/Brussels
2024-03-26 10:48:39,451 - tzlocal - DEBUG - /etc/localtime found
2024-03-26 10:48:39,452 - tzlocal - DEBUG - 2 found:
{'/etc/timezone': 'Europe/Brussels', '/etc/localtime is a symlink to': 'Europe/Brussels'}
2024-03-26 10:48:39,454 - cli - ERROR - Generic processing error.
Traceback (most recent call last):
File "/home/user/Documents/secure/pyHanko/pyhanko/cli/runtime.py", line 50, in pyhanko_exception_manager
yield
File "/home/user/Documents/secure/pyHanko/pyhanko/cli/commands/crypt.py", line 189, in _decrypt_pubkey
w.write(outf)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 608, in write
self._write(stream)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 624, in _write
self._write_objects(stream, object_positions)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 512, in _write_objects
obj.write_to_stream(stream, handler, container_ref)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/generic.py", line 1249, in write_to_stream
value.write_to_stream(stream, handler, container_ref)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/generic.py", line 487, in write_to_stream
data.write_to_stream(
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/generic.py", line 1249, in write_to_stream
value.write_to_stream(stream, handler, container_ref)
AttributeError: 'Reference' object has no attribute 'write_to_stream'
Error: Generic processing error.
This document is indeed pretty large (334 pages, 16M), so it's possible that the object graph is indeed too big.
/Encrypt
dictionaryThis is identifying information, I'm afraid.
Thanks! That helps a lot, I'm now pretty confident that the issues you're seeing are not related to public-key encryption as such (that seems to work), it's got more to do with decrypting the actual objects in the file(s).
The place where it fails in aes_cbc_decrypt, data is b''; resulting in plaintext also being b'', and there likely not being any more padding.
Hmm, this sounds like a bug in the software that produced the document, to be honest. For disambiguation purposes, PKCS#7 padding requires adding a full block of padding bytes even when the message length is an integer multiple of the block size (and AFAIK an empty message is not an exception to this rule). But OK, that's something we can try to tolerate.
AttributeError: 'Reference' object has no attribute 'write_to_stream'
This error is quite curious. Out of curiosity: where did you put that workaround to check for empty strings?
This document is indeed pretty large (334 pages, 16M), so it's possible that the object graph is indeed too big.
Mmkay, I'll look into rewriting _import_object
to not work recursively, then.
This is identifying information, I'm afraid.
Fair enough; I think we've ruled out the Encrypt dictionary as a source of problems anyway :).
By the way: are these the only two public-key encrypted documents you have tried, or are there others?
Hi @redfast00,
I think I managed to hunt down all the issues you saw: the object graph thing, the padding issue, and also the AttributeError: 'Reference' object has no attribute 'write_to_stream'
error you encountered (coincidentally, I stumbled upon that one while fixing the object importing logic).
Can you try again with the code in #414?
Out of curiosity: where did you put that workaround to check for empty strings?
In aes_cbc_decrypt
; but I just checked if len(data) == 0: return data
. That's also where I put the workaround for the certificate-looking data that got decrypted but shouldn't be (I think).
By the way: are these the only two public-key encrypted documents you have tried, or are there others?
Correct, these are the only two I've tried in pyHanko.
Can you try again with the code in https://github.com/MatthiasValvekens/pyHanko/pull/414?
Yes! I'll do that tomorrow; thanks for trying to fix this :+1:
@MatthiasValvekens good news and bad news!
With the code from your PR, I still get:
Traceback (most recent call last):
File "/home/user/Documents/secure/pyHanko/pyhanko/cli/runtime.py", line 50, in pyhanko_exception_manager
yield
File "/home/user/Documents/secure/pyHanko/pyhanko/cli/commands/crypt.py", line 187, in _decrypt_pubkey
w = copy_into_new_writer(r)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 1293, in copy_into_new_writer
new_root_dict = importer.import_object(input_handler.root)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 1173, in import_object
imported = self._ingest(source_obj)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 1195, in _ingest
raw_dict = {
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 1196, in <dictcomp>
k: self._ingest(v) for k, v in obj.items() if k != '/Metadata'
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/writer.py", line 1191, in _ingest
obj = obj.decrypted
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/generic.py", line 2067, in decrypted
decrypted = pdf_string(cf.decrypt(local_key, obj.original_bytes))
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/crypt/filter_mixins.py", line 134, in decrypt
return aes_cbc_decrypt(key, data, iv)
File "/home/user/Documents/secure/pyHanko/pyhanko/pdf_utils/crypt/_util.py", line 16, in aes_cbc_decrypt
plaintext = decryptor.update(data) + decryptor.finalize()
File "/home/user/Documents/secure/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/base.py", line 184, in finalize
data = self._ctx.finalize()
File "/home/user/Documents/secure/venv/lib/python3.10/site-packages/cryptography/hazmat/backends/openssl/ciphers.py", line 223, in finalize
raise ValueError(
ValueError: The length of the provided data is not a multiple of the block length.
but I already got this error before; if I change the AES decrypt function to:
def aes_cbc_decrypt(key, data, iv, use_padding=True):
if len(data) == 8636:
print("WARNING: special case for somehow not encrypted data")
print(data)
return data
cipher = Cipher(algorithms.AES(key), modes.CBC(iv))
decryptor = cipher.decryptor()
plaintext = decryptor.update(data) + decryptor.finalize()
# we tolerate empty messages that don't have padding
if use_padding and len(plaintext) > 0:
unpadder = padding.PKCS7(128).unpadder()
return unpadder.update(plaintext) + unpadder.finalize()
else:
return plaintext
then the small PDF decrypts fine, and seems to look okay :tada:
I also tested the large PDF with the same code, this also decrypts now, and also hits the same 'somehow not encrypted data' path as the small PDF (with the exact same length).
If it is useful to you, I think the 'somehow not encrypted data' is a certificate, padded with null-bytes. It is clearly plaintext, since I can see the string This certificate has been issued in accordance with the GlobalSign CPS located at https://www.globalsign.com
in the data. pypdf with my patch does not error on this, so I think it should be possible to somehow see the difference between data that needs to be decrypted and data that doesn't (with a better heuristic than just looking at the ciphertext length and seeing if it is a multiple of the block length).
Hi @redfast00,
Aha, I suspect that your input files contain digital signatures... :D. Does the object reside in a key named /Contents
?
That would also immediately explain a lot of things:
AttributeError
you ran into before only manifested when a part of the PDF object graph points back to the document catalog. This almost never happens in "regular" PDF documents, unless....the document contains a digital signature with DocMDP.Anyway, if that's what's going on, then the mystery as to why you're hitting all those wonky corner cases is also solved ;).
I can probably make the importer aware of DigSig data in the input file and make it deal with those things gracefully (but the signatures will be not be valid for the decrypted output either way, so be mindful of that).
@MatthiasValvekens the input file indeed contains a digital signature: it is signed + encrypted.
I don't know what key this object resides in; I couldn't immediately find a way to get the key from the AES decryption function.
but the signatures will be not be valid for the decrypted output either way, so be mindful of that
I really don't mind, I'd already be very happy to be able to read my document in an open-source PDF reader (instead of having to boot Windows just to use Adobe Reader).
As feedback from a user, for me it was not immediately clear from the README that pyHanko could be used to decrypt PDF documents; maybe this is useful to add to the README/documentation?
I can probably make the importer aware of DigSig data in the input file and make it deal with those things gracefully (but the signatures will be not be valid for the decrypted output either way, so be mindful of that).
I've implemented, tested and merged this (see #414). Can you give it a try on your files? If that works, then we can close this issue as well.
As feedback from a user, for me it was not immediately clear from the README that pyHanko could be used to decrypt PDF documents; maybe this is useful to add to the README/documentation?
Right, the README mentions encryption support but does not explicitly call out encryption/decryption of files as a feature. That's indeed worth some clarification. Thanks!
Werkt perfect, bedankt om dit op te lossen!
Describe the bug
Crash when trying to decrypt Adobe.PubSec encrypted file
To Reproduce
I'm afraid this will be very hard to reproduce, since I can't share the files used to reproduce this. This is the output:
Expected behavior
The document decrypts.
Environment (please complete the following information):
Python 3.10.12
Additional context
I've had problems with the 40 bit RC2 in the past, so I upgraded the key by following https://www.docuseal.co/docs/convert-legacy-p12-pfx-files-to-support-openssl-3, but I still have the same problem.