MatthiasValvekens / pyHanko

pyHanko: sign and stamp PDF files
MIT License
483 stars 71 forks source link

Expose encryption dictionary in PdfFileReader as instance variable #332

Closed benjamin-awd closed 9 months ago

benjamin-awd commented 10 months ago

Is your feature request related to a problem? Please describe. I'd like to be able to access the encryption dictionary after creating an instance of PdfFileReader, without making an additional call to _get_encryption_params(). https://github.com/MatthiasValvekens/pyHanko/blob/e13f7c1c803b0ea753a8b34a44f2c12018d79d06/pyhanko/pdf_utils/reader.py#L162)

Instead of:

pdf = PdfFileReader("foo.pdf")
bar = pdf._get_encryption_params()

I'd like to have:

pdf = PdfFileReader("foo.pdf")
bar = pdf.encrypt_dict

Describe alternatives you've considered It's possible to use the security_handler instance variable to access certain values as represented in encryption dictionary, but it's nice to be able to still access to the "raw" dictionary.

Additional context For full context, I'm hoping to improve this part of my codebase, which makes an extra call to _get_encryption_params() (which is already done by PyHanko when building the security handler)

MatthiasValvekens commented 10 months ago

Hi,

This seems like a reasonable request at first sight, but I have some maintenance concerns: you only care about being able to read this info, right? The reason why I ask is because if this dictionary were to be exposed as an instance variable, it would have to be done in a way that does not interfere with incremental updates and the like. For now, I think a big fat "don't even think about trying to write to this dictionary" in the docs is enough, but I figured I'd ask.

benjamin-awd commented 10 months ago

you only care about being able to read this info, right?

That is correct -- I'm only interested in reading the dictionary.

I think you have a good point re: maintenance. Besides updating the docs like you said, you could set it as a private class variable

self._encrypt_dict = self._get_encryption_params()

Otherwise, I think using a cached property here could be nice, since that suggests that the encryption dictionary is immutable, is only generated once and shouldn't be modified.

class PdfFileReader(PdfHandler):
    def __init__(self, stream, strict: bool = True):
    ...
        if (encrypt_dict := self.encrypt_dict) is not None:
            self.security_handler = SecurityHandler.build(encrypt_dict)

    @cached_property
    def encrypt_dict(self) -> Optional[generic.DictionaryObject]:
        return self._get_encryption_params()

(this is Python 3.8, so maybe just a regular property without the walrus operator if you want to maintain 3.7 support)

MatthiasValvekens commented 10 months ago

OK, this is probably not hard to do. I have some more pressing things to get out there first, but this sounds like something doable in the short term.

I don't remember off the top of my head whether the result of _get_encryption_params is already cached automatically. It calls get_object under the hood, which implicitly caches a lot of things in "normal" operation, but the never_decrypt=True thing might actually bypass that. I'd have to look into that too.

benjamin-awd commented 10 months ago

No worries -- there's no strong urgency on this, just a nice to have.

I was curious about the caching, so I went to take a look by running this:

from pyhanko.pdf_utils.reader import PdfFileReader
import pysnooper

@pysnooper.snoop("out.log", depth=5)
def main():
    with open("protected.pdf", "rb") as file:
        pdf = PdfFileReader(file)

    second_call= pdf._get_encryption_params()

main()

It seems there is indeed some caching going on under the hood -- on the first run while passing the file to PdfFileReader, the cache_get_indirect_object returns None

Source path:... /home/ben/dev/pyHanko/pyhanko/pdf_utils/reader.py
New var:....... encrypt_ref = IndirectObject(12, 0)
21:25:02.402659 line       288         if isinstance(encrypt_ref, generic.IndirectObject):
21:25:02.402890 line       289             return self.get_object(encrypt_ref.reference, never_decrypt=True)
    Starting var:.. self = <pyhanko.pdf_utils.reader.PdfFileReader object at 0x7f0c1b5cc450>
    Starting var:.. ref = Reference(idnum=12, generation=0)
    Starting var:.. revision = None
    Starting var:.. never_decrypt = True
    Starting var:.. transparent_decrypt = True
    Starting var:.. as_metadata_stream = False
    21:25:02.402994 call       327     def get_object(
    21:25:02.403355 line       366         if ref in self.xrefs.xref_stream_refs:
        Source path:... <string>
        Starting var:.. self = Reference(idnum=12, generation=0)
        21:25:02.403463 call         2 Implementation of PDF object types and other generic functionality.
        21:25:02.404057 return       3 The internals were imported from PyPDF2, with modifications.
        Return value:.. 4031436574962064298
    Source path:... /home/ben/dev/pyHanko/pyhanko/pdf_utils/reader.py
    21:25:02.404222 line       368         if revision is None:
    21:25:02.404391 line       369             obj = self.cache_get_indirect_object(ref.generation, ref.idnum)
        Starting var:.. self = <pyhanko.pdf_utils.reader.PdfFileReader object at 0x7f0c1b5cc450>
        Starting var:.. generation = 0
        Starting var:.. idnum = 12
        21:25:02.404494 call       470     def cache_get_indirect_object(self, generation, idnum):
        21:25:02.404802 line       471         out = self.resolved_objects.get((generation, idnum))
        New var:....... out = None
        21:25:02.404900 line       472         return out
        21:25:02.405072 return     472         return out
        Return value:.. None

but when I call the _get_encryption_params() again via the second_call variable, it returns the encryption dictionary.

    New var:....... encrypt_ref = IndirectObject(12, 0)
    21:25:02.437057 line       288         if isinstance(encrypt_ref, generic.IndirectObject):
    21:25:02.437296 line       289             return self.get_object(encrypt_ref.reference, never_decrypt=True)
        Starting var:.. self = <pyhanko.pdf_utils.reader.PdfFileReader object at 0x7f0c1b5cc450>
        Starting var:.. ref = Reference(idnum=12, generation=0)
        Starting var:.. revision = None
        Starting var:.. never_decrypt = True
        Starting var:.. transparent_decrypt = True
        Starting var:.. as_metadata_stream = False
        21:25:02.437395 call       327     def get_object(
        21:25:02.438047 line       366         if ref in self.xrefs.xref_stream_refs:
            Source path:... <string>
            Starting var:.. self = Reference(idnum=12, generation=0)
            21:25:02.438177 call         2 Implementation of PDF object types and other generic functionality.
            21:25:02.438612 return       3 The internals were imported from PyPDF2, with modifications.
            Return value:.. 4031436574962064298
        Source path:... /home/ben/dev/pyHanko/pyhanko/pdf_utils/reader.py
        21:25:02.438823 line       368         if revision is None:
        21:25:02.439037 line       369             obj = self.cache_get_indirect_object(ref.generation, ref.idnum)
            Starting var:.. self = <pyhanko.pdf_utils.reader.PdfFileReader object at 0x7f0c1b5cc450>
            Starting var:.. generation = 0
            Starting var:.. idnum = 12
            21:25:02.439166 call       470     def cache_get_indirect_object(self, generation, idnum):
            21:25:02.439529 line       471         out = self.resolved_objects.get((generation, idnum))
            New var:....... out = {'/CF': {'/StdCF': {'/AuthEvent': '/DocOpen', '/...xbb\xa7\x05g\xea9\xa5f[\x827(x\x8e\x8e', '/V': 5}
            21:25:02.439663 line       472         return out
            21:25:02.439881 return     472         return out
            Return value:.. {'/CF': {'/StdCF': {'/AuthEvent': '/DocOpen', '/...xbb\xa7\x05g\xea9\xa5f[\x827(x\x8e\x8e', '/V': 5}
MatthiasValvekens commented 9 months ago

Coming back to this now (apologies for the delay): now that we've established that _get_encryption_params() hits the object cache anyhow, what would be the added value of exposing it as an underscored property?

benjamin-awd commented 9 months ago

I would say that there's some value in exposing it as a regular (instead of underscored) property.

I think it's desirable as an end-user to use a public property (where available) instead of a private method. Mostly because then there isn't a need to worry about the underlying logic of "getting" the encryption dictionary. Calling _get_encryption_params() directly can also cause linters like Pylint to complain about protected access.

If the security handler is also cached, you could also just make it a property as well, which would simplify the __init__ clause of the PdfFileReader class.

If you like, I could try to take a stab at this.

MatthiasValvekens commented 9 months ago

OK, I see your point if it's a "public" property. The implementation of the property would basically be the same as _get_encryption_params(), though (potentially with some extra safeguards to prevent it from messing with incremental updates). I'll have another look :).

I'm a bit hesitant to mess with the way the security handler is loaded; that's a pretty delicate thing, and it's in an annoying position where it needs some bare minimum PDF reading primitives for bootstrapping purposes, but is also essential to being able to interact with the rest of the PDF document (assuming it's encrypted, that is), so setting it up in a "suprise-free" way is not easy. I'm open to discussion, though: I'm also not a fan of the fact that it currently involves doing nontrivial work in __init__. But the "chicken or egg"-ish nature of the current setup is largely due to the fact that the encryption dictionary is, quite frankly, badly encapsulated in the PDF specification as such.

benjamin-awd commented 9 months ago

Yeah, I tried playing around with the security handler to try to make the __init__ statement cleaner, but basically just ended up shifting complexity (i.e. the if condition) to the SecurityHandler.build() method. It would probably be non-trivial to make into a property of some kind.

MatthiasValvekens commented 9 months ago

See linked PR, comments welcome :).