ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

Support for filling form fields programmatically #132

Open liammcdermott opened 10 months ago

liammcdermott commented 10 months ago

I know there are usually better solutions than pdfium-render for this (pdftk for example). However, a convoluted set of circumstances have led me to attempt filling PDF forms using pdfium-render.

Specifically, I have a PDF file that includes a form with text fields, and I am attempting to fill those text fields with values programmatically, then save a copy of the PDF with those fields filled.

Here is what I have so far:

fn pdf_fill_test(pdf_path: &(impl AsRef<Path> + ?Sized)) -> Result<(), Box<dyn std::error::Error>> {
    let pdfium: Pdfium = Pdfium::default();
    let document = pdfium.load_pdf_from_file(pdf_path, None)?;
    let pages = document.pages();

    for page in pages.iter() {
        let page_handle = document.bindings().get_handle_from_page(&page);
        let annotations = page.annotations();
        let mut form_fill_info = annotations.bindings().create_formfillinfo(1);
        let form_fill_info_ptr = &mut form_fill_info as *mut _;
        let b = annotations.bindings();

        // See: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/samples/simple_no_v8.c
        let form_handle = annotations.bindings().FPDFDOC_InitFormFillEnvironment(
            document.bindings().get_handle_from_document(&document),
            form_fill_info_ptr,
        );
        b.FORM_OnAfterLoadPage(page_handle, form_handle);

        let c_int_index = (0..).map(|i| i as c_int);
        for (annotation_index, annotation) in c_int_index.zip(annotations.iter()) {
            if annotation.annotation_type() == PdfPageAnnotationType::Widget {
                let annotation_handle = b.FPDFPage_GetAnnot(page_handle, annotation_index);

                b.FPDFAnnot_SetStringValue_str(
                    annotation_handle,
                    "M",
                    &date_time_to_pdf_string(Utc::now()),
                );
                b.FPDFAnnot_SetStringValue_str(annotation_handle, "V", "TEST");

                b.FPDFPage_CloseAnnot(annotation_handle);
            }
        }

        b.FORM_OnBeforeClosePage(page_handle, form_handle);

        b.FPDFDOC_ExitFormFillEnvironment(form_handle);
        b.FPDFPage_GenerateContent(page_handle);
    }

    document.save_to_file("modified_document.pdf")?;

    Ok(())
}

Right now this code just fills the value of every widget field with 'TEST', but it does fill the values of the form widgets successfully. However, when opening the PDF, programmatically set values do not appear on the page until the user sets the focus on the form field containing that value. Meaning, if the form field is a text box, the user has to set the focus inside it before the value will appear.

I did some debugging, and found that while this code successfully updates form field values in the obj /Type /Annot, when I checked the corresponding appearance stream, the text and related text drawing commands were absent.

At that point I realised it's time to stop trying to figure this out myself, and get some help. Here are my questions:

  1. Do you have any pointers for getting the appearance streams updated?
  2. Is programmatically filling PDF forms something you would like the idiomatic Rust API to support?
  3. It wasn't my intention, but nearly all my code uses Pdfium bindings and not the idiomatic Rust API of pdfium-render; as you probably already noticed, I forked the library and added a couple of functions from the bindings (FORM_OnBeforeClosePage() for example); so with that in mind, would you be interested in a PR?

Obviously, any code I submitted in a PR would be less messy than the example above! Thank you for your consideration.

liammcdermott commented 10 months ago

The following is some debugging information, in case it's helpful.

# Text form field with text added using Evince prior to loading the PDF file, after pdf_fill_test():
field name `outside_closing_txt`
annotation > as_form_field() > as_text_field() >
appearance_mode_value(PdfAppearanceMode::Normal): Some("/Tx BMC\nq\nBT\n/courier-bold 10.0 Tf 0 g 1 0 0 1 3.00 3.50 Tm \n(TESTY MCTEST) Tj\nET\nQ\nEMC\n")
appearance_stream(): Some("N")

# Text form field that is empty in the PDF file, after pdf_fill_test():
field name `termination_period_txt`
annotation > as_form_field() > as_text_field() >
appearance_mode_value(PdfAppearanceMode::Normal): Some("/Tx BMC\nq\n2 14 m\n172 14 l\n172 1 l\n2 1 l\n2 14 l\nh\nW\nn\nQ\nEMC\n")
appearance_stream(): None

Calling annotation.objects().iter() on outside_closing_txt's annotation, and getting the objects collection's length and type of each object yielded this:

Objects len: 1
Obj: Text

However, for termination_period it yielded this:

Objects len: 0
ajrcarey commented 10 months ago

Hi @liammcdermott , thank you for reporting the issue. I am happy to help you with this. The appearance streams for the form fields likely need to be created manually - assuming Pdfium offers a sufficient interface to do so, I need to check that - and the work could potentially be related to #89, which also touches appearance streams.

liammcdermott commented 10 months ago

Great to hear you'd like to help with this, thanks @ajrcarey!

Regarding whether Pdfium supports appearance streams for form fields: I'm not sure I'm looking in the right place, but there is CPDF_GenerateAP::GenerateFormAP(), is that the interface we're looking for?

ajrcarey commented 10 months ago

That interface is private, unfortunately. We're limited to the public FPDF_* functions. The FPDFAnnot_SetAP() function looks like a promising place to start, although the docs are light on details: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_annot.h#617

ajrcarey commented 10 months ago

(That said, the GenerateFormAP() function you linked to may give some hints as to how to generate the appearance stream code programmatically, particularly the GenerateEditAP() and GenerateColorAP() functions.)

liammcdermott commented 10 months ago

I followed the usages of that private GenerateFormAP() function, up until I reached a public interface, and found FPDFPage_TransformAnnots() code link.

Maybe we could:

  1. Clear the existing appearance stream, by passing a null pointer to FPDFAnnot_SetAP()
  2. Call FPDFPage_TransformAnnots(page, PDFMatrix::IDENTITY) to trigger a rebuild of appearance streams for annotations on the page.

Side note: (1) is necessary, since (2) only generates appearance streams for annotations that don't already have them.

This solution relies on Pdfium's internal implementation recreating the appearance streams when they don't exist, however, I'm sanguine about that, since page 678 of the PDF spec says:

If the widget annotation has no appearance dictionary, the viewer application must create one and store it in the annotation dictionary’s AP entry.

The spec is pretty clear, Pdfium must regenerate the appearance stream if it finds there isn't one.

Then we could send a patch upstream, adding a function like FPDFPage_GenerateContent() but for annotations (FPDFPage_GenerateAnnotations()?). Then we won't be relying on implicitly defined behaviour.

What do you think?

liammcdermott commented 10 months ago

Huh, I just did an experiment, adding this to my code above, just before the call to FPDFPage_CloseAnnot() and it works:

b.FPDFAnnot_SetAP(annotation_handle, PdfAppearanceMode::Normal as i32, null());

By 'it works' I mean, when I open the filled PDF in Chrome or Evince, the form fields are filled out. So, thanks to you pointing out FPDFAnnot_SetAp() I might actually meet this work deadline! Thank you so much @ajrcarey!

(I still want to make a PR for this)

ajrcarey commented 10 months ago

Superb work. Did you need to use FPDFPage_TransformAnnots() in the end?

To answer your earlier question: yes, mutation of form field values would be a valuable addition to pdfium-render, presumably by adding some functionality to the PdfPageAnnotationCommon and PdfPageAnnotationPrivate traits. I'm assuming it's the call to FPDFAnnot_SetStringValue_str() with a key of "V" in your sample that actually sets the form field value?

I'm curious as to whether the FORM_*() function calls are actually necessary, or if your successful experiment still works without them. If you wanted to submit a PR to add bindings for the new FORM_*() functions, that'd be swell. I am happy to work on the trait implementations, unless you especially wanted to; but given you already have a solution using raw FPDF_* functions, I certainly wouldn't expect you to rewrite it at this point.

liammcdermott commented 10 months ago

Superb work. Did you need to use FPDFPage_TransformAnnots() in the end?

Nope! I just needed to add that one line. It's something of a hack, since AFAICT it leads to the annotations having no appearance streams in the resulting PDF. That forces clients (like Chrome and Evince) to generate the appearance streams themselves.

I'm thinking a better implementation would be triggering Pdfium to regenerate the appearance streams.

To answer your earlier question: yes, mutation of form field values would be a valuable addition to pdfium-render, presumably by adding some functionality to the PdfPageAnnotationCommon and PdfPageAnnotationPrivate traits. I'm assuming it's the call to FPDFAnnot_SetStringValue_str() with a key of "V" in your sample that actually sets the form field value?

That's great to hear. Those are the traits I was looking at and taking ideas from, so yes, I'm assuming the functionality should be added to them. FPDFAnnot_SetStringValue_str() with a key of "V" is indeed what sets the value of the form fields.

I'm curious as to whether the FORM_*() function calls are actually necessary, or if your successful experiment still works without them. If you wanted to submit a PR to add bindings for the new FORM_*() functions, that'd be swell. I am happy to work on the trait implementations, unless you especially wanted to; but given you already have a solution using raw FPDF_* functions, I certainly wouldn't expect you to rewrite it at this point.

I suspect the FORM_*() calls aren't strictly necessary, I just noticed them in Pdfium's sample code: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/samples/simple_no_v8.c and https://pdfium.googlesource.com/pdfium/+/refs/heads/main/samples/pdfium_test.cc#1633 Although, I'm really not sure what's going on in much of that code.

If you don't mind, I'd like to submit PRs for both bindings and trait updates. The only reason I used the bindings directly was to avoid forking pdfium-render (eventually I failed even at that!), and while working on the problem I got a good feel for how mutation of form field values could be added to the traits.

If I can't get both done by the end of next week, I'll pass it back to you, does that sound okay? Let me know if you have any particular requirements for the implementation of this (beyond the usual, like code style and whatnot).

ajrcarey commented 10 months ago

Many thanks. By all means, take as much time as you like. I'm not in any hurry and could not realistically start work on this for a couple of weeks anyway.

liammcdermott commented 9 months ago

Well I said 'by the end of next week', but work had other plans. My hope is to work on it this weekend.

BTW: I did meet the deadline, and the form filling is working well in testing, so I'm hoping this will be a good addition. However, there is one major caveat: my itch was text fields, and that's what I've scratched. I'd love to implement filling check boxes and so forth, but that will have to come later (AFAICT it's not that different from filling text fields).

ajrcarey commented 9 months ago

Great that you met your deadline and text boxes only is perfectly fine, I'm happy to implement the rest based on your template.

ajrcarey commented 8 months ago

Hi @liammcdermott , any updates on this?

liammcdermott commented 8 months ago

@ajrcarey I have made a start, and have blocked off some more time tomorrow to work on this. I'll let you know how I get on!

liammcdermott commented 8 months ago

Made good progress on this today. I should have a PR for you sometime tomorrow @ajrcarey

ajrcarey commented 8 months ago

Merged pull request. Made some small adjustments to doc comments and imports. Updated README. Added new examples/fill_form_field.rs example. Began work on applying same basic approach to filling checkbox and radio button form fields.