MatthiasValvekens / pyHanko

pyHanko: sign and stamp PDF files
MIT License
460 stars 68 forks source link

How to get custom text in sign #423

Closed Asterix45 closed 2 months ago

Asterix45 commented 3 months ago

Hello,

I'm trying to get the custom text of a sign from a pdf file, I'm working like this: I've opend the pdf file in this way

with open('file.pdf', 'rb') as doc:
    r = PdfFileReader(doc)

and then, inspecting runtime variables I've seen that the sign I want to read is inside this variable

r.embedded_signatures[1]

(my document has two signs, I want to work on the second one.)

Inside embedded_signature[1] I can find all the information about the sign certificate, the provider and the owner of the sign, I can't find the text inside the sign and I would like to understand how to ge this information.

To be the most clear as possibile, in the documentation this feature is used

here

to generate this sign

text-stamp-basic

In this case, opening the signed pdf file, I would like to get the text

"This is custom text! Signed by: Alice alice@example.com Time: 2021-06-24 08:00:00 CEST"

MatthiasValvekens commented 2 months ago

Hi @Asterix45,

That is a surprisingly hard problem ;). Text extraction from general PDFs is a whole domain in itself, and very far outside the scope of pyHanko. Whether the text appears in a signature appearance or not makes things slightly easier (in that you know which content streams to analyse), but actually extracting the text and decoding it into something readable is not always trivial.

It would be doable to hack something together that works (somewhat) reliably on pyHanko output, because pyHanko is quite reasonable by default (it supplies a ToUnicode map for embedded fonts, reading order matches content stream order, etc. etc.). But even for that, I think pulling in a library that properly supports text extraction is better. iText does this, among many others: https://github.com/itext/itext-java.

That said, I suspect that your question is actually an X/Y problem. Are you really trying to extract text, or do you simply want access to metadata about the signature and/or the signer? Because there are easier ways to go about that.

EDIT: I'm also converting this to a discussion.