MatthiasValvekens / pyHanko

pyHanko: sign and stamp PDF files
MIT License
526 stars 75 forks source link

Converting existing parts of a document to empty signature field #8

Open ofcaah opened 3 years ago

ofcaah commented 3 years ago

Hi! First of all, this is not an issue with the toolkit itself. So far it works great for my uses. I'd love to see more "alpha quality" code like this... ;)

To the point: I'm currently using a headless LibreOffice to convert a template to final PDF file with some replacements along the way. Problem is, I have to manually "target" coordinates for signature fields with trial and error. Do you have an idea how would I go about automating the process in pyHanko? Number of required signatures is dynamic, so I'm thinking some kind of search & replace type solution here.

MatthiasValvekens commented 3 years ago

Hi there!

You're right that targeting a region on the page by trial and error is not really a good way to go about this. That being said, going about it any other way is generally pretty hard. As you may or may not know, a content stream in PDF is just a concatenation of a bunch of graphical operators that draw stuff, so even something as simple as "find the location of all rectangles on the page" is a nontrivial task.

Since PDF 1.4, it's possible to tag content & structure in a PDF file. Essentially, this works by inserting content markers into the content streams in the file, and then arranging these markers into a DOM-like tree. If the document you're operating on is decently tagged, you would then (for example) be able to do things such as finding a particular table cell, grab its bounding box, and put in a signature field that fills the table cell. I know that there's an option in the LibreOffice GUI to output tagged PDF files, so probably the same is true for headless LibreOffice as well.

Right now, only some parts of the validation code in pyHanko are tag-aware at all (and only minimally so). I have plans to expand that functionality in the future, but I'm not sure when that is going to happen. I'll definitely keep your use case in mind, though, and try to think of a sensible API to "convert" PDF structure elements to signature fields that doesn't make too many assumptions about the input.

In the meantime: I thought LibreOffice also had some basic signing functionality. Couldn't you use that to set up the fields? I haven't really used it much myself, though, so I wouldn't be able to tell you how it works.

ofcaah commented 3 years ago

There are some open requests about this in the wild, but there was no way to export empty signature fields from LibreOffice as of two months ago. One of such requests is here: https://bugs.documentfoundation.org/show_bug.cgi?id=126207 - my message from last year being the last.

Actually that's quite a rare feature in the whole PDF ecosystem to my knowledge, especially when taking only open solutions into account. I'm using this to generate contracts that are pre-stamped with certification feature to prevent unwelcome alterations and with predefined empty signature fields. It works as a PoC and only problem I found so far is adding the new fields in predictable and visually-appealing fashion. Currently I'm adding a plain rectangle in LO and then fill it with sigfield using pyHanko. I'm in full control of the template documents, so I can modify it in a most sensible and approachable fashion, once you find the time to suggest such.

MatthiasValvekens commented 3 years ago

OK, I'll keep you posted! With some luck, I might be able to get this into 0.6.0 in some form, but I can't promise anything quite yet :)

MatthiasValvekens commented 3 years ago

I did some exploratory testing using LibreOffice, and it seems that this is going to be even harder than I thought.... I tried adding a text box (to serve as a signature placeholder) to a one-page, fairly well-structured ODF file. The text box was anchored to a paragraph in a table. I also gave the text box with a visible red border, just to make it easier to find when scrolling through the graphics operators in the content stream. I then exported the file to tagged PDF.

This is what happened structurally:

Graphically, it's even more of a mess. This is the PDF graphics code generated by LibreOffice to render the text box:

1 0 0 RG
q
1.4 w
0 J
1 j
325.55 468.139 m
214.4 468.139 l
214.4 504.339 l
436.65 504.339 l
436.65 468.139 l
325.55 468.139 l
h
S
Q
/P <</MCID 13 >> BDC
q
0 0 0 rg
BT
215.1 493.589 Td
/F3 12 Tf
<0102030405060708090a0b0c0d060e0a0f100d110a0912> Tj
ET
Q
EMC

As you can see, the marked content sequence (MCID 13) that identifies the text box doesn't even cover the border of the text box. Hence, even if you have the right structure element, figuring out where it lives on the page is hard. The graphics operators drawing the border happen to be positioned close by in this case, but that's obviously not something we can rely on in general. Also, instead of using the rectangle operator, LibreOffice draws the text box's border as a piecewise linear path (in fact, with dotted line styles, it gets even more ridiculous).

All this doesn't even begin to touch the issue of compatibility between rendering applications... So yeah, this is going to be a tricky one to get right ;)

ofcaah commented 3 years ago

Yup, that pretty much sums my research with pdf2py that I've done some time ago. From my point of view I'd accept pretty much conversion of any object that doesn't collide with "normal" contact's text, as in buttons or other objects. Help! :)

MatthiasValvekens commented 3 years ago

Here's an idle thought (haven't had coffee yet, so take this with a grain of salt). Any solution that relies on the actual content stream of the page to identify a region to be signed will be tricky to implement, since it requires a parser that understands the various geometric operators in PDF with a fairly high degree of generality.

Part of the problem here is that form fields are fundamentally different from page content in a way: they are rendered as widget annotations that live "outside" the page's main content, both in terms of file structure and how they're rendered graphically. So one possible way to get around the problem might be to rely on placeholder annotations instead. These require much less effort to manipulate into a form field widget. Of course, it assumes that whatever you're using to produce your PDFs has the ability to output annotations, but if I recall correctly, that's the case for LibreOffice, right?

Anyway, you could then convert LibreOffice output (potentially with form fields already embedded) into a "signable" form by (for example) replacing all text annotations with a particular content string with form fields.

I'll toy around with that idea a bit when I find the time...

tuelle commented 3 years ago

Hi,

I've also been working on a method to replace placeholders in PDFs generated from a docx file. The following code works fine for me, but my use-case has also has documents that are more or less standardized. I am using pdfplumber to search for table cells or rectangles that contain a certain string, e.g. "signature field" in the PDF.

Perhaps it gives you some inspirations. I haven't checked it with PDFs generated with other programs than MS Word yet.

You can find the code attached.

Best regards Thorsten

extract.zip example.pdf

ofcaah commented 3 years ago

Thanks Matthias and Thorsten, I'll take another deep look at this tomorrow and figure something out from your suggestions. In the meantime I've moved signature boxes to a static location relative to bottom of last page, which seems like a workable workaround for the time being.

MatthiasValvekens commented 3 years ago

It doesn't really solve the exact issue that you're having, but if you're willing to migrate your templating work to TeX, there are LaTeX packages that are capable of producing forms (including signature fields) out there: https://tex.stackexchange.com/questions/51090/how-do-i-create-a-pdf-file-that-can-be-digitally-signed.

(I still haven't found a good way to solve the general problem, though)

FernandoJCabral commented 3 years ago

It doesn't really solve the exact issue that you're having, but if you're willing to migrate your templating work to TeX, there are LaTeX packages that are capable of producing forms (including signature fields) out there: https://tex.stackexchange.com/questions/51090/how-do-i-create-a-pdf-file-that-can-be-digitally-signed.

(I still haven't found a good way to solve the general problem, though) Well, I have been working in a similar problem. I want to position the visible signature somewhere in the last page of a A4 document. Positioning in the last page is not a problem. Tick. Positioning horizontally is not a problem. Tick. The problem is finding the vertical displacement. It varies from document to document. A page may have a single, short paragraph, or various short and long paragraphs. So, the visible signature should appear perhaps 2 cm or so after the last line of the last paragraph. Finding this position automagically has not been easy.

What I have tried to do is to get the vertical displacement using a macro in Basic (but called by a function in python) that returns the visual cursor position (x, y).

Now, before calling pyhanko, I convert LO cursor position X, Y into "--field -1/x,y,x+a,y+a/Sig". Not particularly elegant, but works. I am still working to perfect this solution and make it fool proof.

I would guess you could apply a similar trick but using your template as reference. Perhaps counting the number of lines in the page where you want the signature to be placed.

MatthiasValvekens commented 3 years ago

Thanks; interesting approach! I can imagine that idea working quite well if you have access to the LO template. That said, I don't think it works out of the box if all you have is a PDF file. The reason being that there is no (universal) concept of paragraphs / lines of text in raw PDF graphics, so you'd have to implement a line detector first. That still requires parsing PDF graphics operators. Perhaps there's a better way if the input document has particularly good tagging. but that's a very unreliable assumption.

Anyway, once you're at the point where you have to parse content streams in a PDF file, I actually think that finding rectangular shapes with a particular colour in the page content is easier than trying to count lines of text.

This is a tricky one for sure....

FernandoJCabral commented 3 years ago

To the point: I'm currently using a headless LibreOffice to convert a template to final PDF file with some replacements along the way. Problem is, I have to manually "target" coordinates for signature fields with trial and error. Do you have an idea how would I go about automating the process in pyHanko? Number of required signatures is dynamic, so I'm thinking some kind of search & replace type solution here.

Anyway, once you're at the point where you have to parse content streams in a PDF file, I actually think that finding rectangular shapes with a particular colour in the page content is easier than trying to count lines of text.

Right. My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone. It is too messy to work with for this purpose. Since ODT e in fact a XML file, reading it could work too (I have not tried this approach because I sign the document immediately after finishing it, so I have it open in front of me. I see where I need the signature to be. I put the cursor there and get the coordinates. But reading the ODT file or using the template as reference should work better).

ofcaah commented 3 years ago

My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone.

Unfortunately this won't help me much, as end result has to be a PDF file, and working out coordinates for pyHanko from ODT automatically is definitely beyond my expertise. That being said, I could remove signature bounding boxes and signing person's name below the box from template itself, and leave adding them to pyHanko. Is this something that could be easily accomplished from command line?

i.e. pyhanko sign addfields --field 4/41,141,293,178/FieldName=DigSig1/FieldCaption="John Smith"/FieldStyle=box with alternative FieldStyle being for example 'line'.

This would only require making sure, that there's enough room for signatures at the bottom of the desired page and it could possibly be useful in many more cases.

FernandoJCabral commented 3 years ago

Right. My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone. It is too messy to work with for this purpose. Since ODT e in fact a XML file, reading it could work too (I have not tried this approach because I sign the document immediately after finishing it, so I have it open in front of me. I see where I need the signature to be. I put the cursor there and get the coordinates. But reading the ODT file or using the template as reference should work better).

Well, I have finally put together a working solution for my own problem that is similar to yours. Similar, but not equal. What I do is to put the cursor where I want the signature to appear (if visible) and call my "signPdf" macro. It grabs the cursor position on the document and calculates the rectangle coordinates where to place the signature. If not visible, it just calls pyhanko with the field name, without any coordinates. The macro starts pyhanko with all command line options.

I' ve tested it in many different documents and many different ways. So far, it has worked to my satisfaction.

But, it seems you are running LO in the headless mode, so, figuring out where to place the signature has to resort to a different method. This might be a mark in the document. This way your program could open the document in the background, search for the mark, save the position, remove the mark, generate the PDF, call pyhanko. pyhanko will sign the PDF and place the signed copy where you want it. You will end up with three documents: ODT/PDF/signedPDF (unless you also remove some of them after signing).

The difference between this solution and my solution is that, in my case, signature is placed where the cursor is; in your case, it would be placed where a certain mark is.

My code is in Python. If you want to check it up, I can provide you with the source code. Comments, variable and function names are in Portuguese, but the code is simple enough to be easily understood even without understanding the docstrings and variable names.

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

ofcaah commented 3 years ago

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

No, you are not wrong; it's full auto headless. I'm already using python in the pipeline, so extending it a bit shouldn't be a problem. If your script/macro isn't overly complicated, then please do share it.

FernandoJCabral commented 3 years ago

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

No, you are not wrong; it's full auto headless. I'm already using python in the pipeline, so extending it a bit shouldn't be a problem. If your script/macro isn't overly complicated, then please do share it.

I will share it with you. I'll do it a little bit later because I am busy now and also because I think I should translate into English at least the docstrings and the most important variable names. This will make it easier for you to understand the code. But, rest assured it is quite simple. It has some magic numbers, but on the macro itself I'll explain how I found them and why they are there.

FernandoJCabral commented 3 years ago

Here is the Python macro that converts an ODT file to PDF and signs it using pyhanko. I've added a lot of comments in the hope that it may make it easier for you to understand what each step.

This file is a sample of the pyhanko.yml. It must be renamed to pyhanko.html and updated accordingly.

[pyhanko.txt](https://github.com/MatthiasValvekens/pyHanko/files/7262854/pyhanko.txt

This is the macro file. Adding the txt extension was added because github does not accept a file with the extension .py.

signFile.py.txt

ofcaah commented 3 years ago

Thank you, at first glance it should integrate nicely with what I'm already doing for a headless conversion from template to pdf - I'll just need to figure out finding coordinates of rectangles instead of non-existent mouse cursor, but it shouldn't be that hard.

FernandoJCabral commented 3 years ago

Thank you, at first glance it should integrate nicely with what I'm already doing for a headless conversion from template to pdf - I'll just need to figure out finding coordinates of rectangles instead of non-existent mouse cursor, but it shouldn't be that hard.

I'd guess if you are working with the XML file it may be harder. On the other hand, if you are working with the ODT file open in the background it should be easy to move to the last page, find the last line and place the signature a certain distance bellow it. If it is in a different page (not the last one) you could place a well-chosen string where you want the signature to be (say: "#PutSignatureHere#"). Then your macro could search for this string and replace it with the visible signature. In this case probably it will be much easier to use the line number to find the vertical position (line number times character height + spacing + top margin + etc.).

ag-gaphp commented 8 months ago

I also have a repo of docs that are made with LibreOffice and need signature fields. I took a slightly different solution to what has been talked about, in case it helps anyone.

LibreOffice also does not mark fields as required for some reason, so this script takes any field name that starts with r_ and sets the flags to be required in the PDF.

  1. Use PyMuPDF to locate all field names that start with sig
  2. Store the rectangle coords and page number in a dict
  3. Remove the field from the PDF and save to a temp file
  4. Open with pyHanko and add new signature fields using the rectangle coords from PyMuPDF

It works really, really well and you can export the LibreOffice doc via command line to keep everything in python.

Here are the two functions. Note that pyHanko and PyMuPDF do the y coordinates differently, so you can't just use the Rect coords from PyMuPDF directly, you need to subtract them from the page height so pyHanko places things correctly.

Also, fitz == PyMuPDF

import fitz
from os import remove, scandir, rename
from os.path import exists
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field

# store signature field data, and convert required fields, using PyMuPDF
def convert_fields(_old_path : str, _new_path : str) -> dict:
    print("Setting up new file...")
    if exists(_new_path):
        remove(_new_path)

    # make a copy
    copyfile(_old_path, _new_path)

    _boxes = {}
    print(f"Getting signature field data and removing placeholders...")

    # iterate the pages
    _pn = 0
    _fdoc = fitz.open(_old_path)
    for page in _fdoc:
        # store the page's height for placement
        _page_rect = page.bound()
        _page_height = _page_rect.y1

        # iterate the fields on this page
        for field in page.widgets():
            n = field.field_name            
            print(f"Found field '{n}' on page {_pn}...")

            # if it's a signature, store the dimensions of the box
            if n.startswith("sig") or n.startswith("init"):
                print("...storing info for signature and removing...")
                _type = 1 if n.startswith("sig") else 2
                # PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
                # Subtract the y coords from the current page height for pyHanko
                _boxes[n] = {
                    "page": _pn,
                    "type": _type,
                    "box": (
                        field.rect.x0,
                        _page_height-field.rect.y0,
                        field.rect.x1,
                        _page_height-field.rect.y1
                    )
                }

                # rename field and set to read-only in case removal fails
                field.field_name = field.field_name + "_orig"
                field.field_flags = 1
                field.update()
                # mark the field for removal on save
                page.delete_widget(field)
                print("...marked for removal!")

            # if it's a required field, mark it as such the PDF-way
            elif n.startswith("r_"):
                print("...marking as required...")
                field.field_name = n.replace("r_", "")
                field.field_flags = 2
                if field.field_type in [5, 2]:
                    field.field_value = "Off"
                field.update()

        _pn += 1

    # save the document updates
    _fdoc.save(_new_path, garbage=1)
    _fdoc.close()

    # return the box dimensions for add_signatures()
    return _boxes

# add proper PDF signature fields based on placeholder data using pyHanko
def add_signatures(_new_path : str, _boxes : dict) -> bool:
    print("Adding new signature fields to document...")
    try:
        with open(_new_path, 'rb+') as _doc:
            _d = IncrementalPdfFileWriter(_doc, strict=False)
            for name in _boxes.keys():
                _dict = _boxes[name]
                append_signature_field(_d, SigFieldSpec(
                                            sig_field_name=name,
                                            on_page=_dict["page"],
                                            box=_dict["box"]
                                        ))
            _d.write_in_place()
        print("...done!")
        return True
    except Exception as e:
        print(f"...failed to add signature fields.\nERROR: {e}")
        return False