J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.67k stars 176 forks source link

Insert image and flatten form #268

Open vekonylaszlo opened 9 months ago

vekonylaszlo commented 9 months ago

Hey there,

I've been digging into the docs and GitHub discussions but I'm a bit stuck. I'm trying to figure out how to add an image from memory to a specific rect in a PDF and then flatten the form. Any pointers on how to tackle this would be awesome.

For reference, i currently trying this:

#[derive(Debug, Clone, Default)]
  pub struct Rectangle {
      left: f32,
      bottom: f32,
      width: f32,
      height: f32,
  }

    pub fn add_image_on_coordinates(&mut self, coord: Rectangle) {
        let image_path = r#"barcode.png"#;
        let pages = self.document.get_pages();
        let page_obj_id = pages.iter().nth(0);
        let stream_dictionary = dictionary! {};
        let (mut x, mut y) = (0.0, 0.0);
        if let Some(page_oid) = page_obj_id {
            if let Ok(stream) = xobject::image(image_path) {
                self.document
                    .insert_image(*page_oid.1, stream, (x + 10., y + 10.), (50., 50.))
                    .unwrap();
                self.document.save("output.pdf").expect("should have saved");
            }
        }
    }

But the PDF is corrupted after save. Thanks in advance for any help!

chriskyndrid commented 4 months ago

For flattening is a bit complicated. You need to identify all the form fields:

 let catalog = document
            .trailer
            .get(b"Root")
            .and_then(|obj| obj.as_reference())?;
        let catalog_dict = document.get_object(catalog)?.as_dict()?;
        let acroform = catalog_dict
            .get(b"AcroForm")
            .and_then(|obj| obj.as_reference())?;
        let acroform_dict = document.get_object(acroform)?.as_dict()?;
        let fields_list = acroform_dict.get(b"Fields")?.as_array()?;

YOu then need to iterate over the fields and potentiall Kids and extract all the coordinates of the fields and their bounding boxes. Then you need to write in your text into the bounding boxes. You will also need to calculate the available space you have within the bounding box and measure the width of your words in pixels to determine where you need to start the next line(also paying attention to height of the words in pixels), and then truncate, downsize(reduce font size), etc depending on the fitment in your boxes and your desired use case.

I just spend the last week putting together a fully functioning library for my codebase on top of lopdf that is designed to generate documents from templates we design, map data from our system onto the forms, flatten them(technically if we flatten we don't fill the forms first, we just extract the coordinates and render the content), then optimize out the document, which includes deduplicating fonts, moving all text objects into XObject's, etc, to reduce the size of the resulting PDF. Many of our use cases involve duplicating document templates, and merging them (say a 10 page invoice), and mergers don't inherently include optimizations with this library. That said, after I wrote a bunch of code to parallel process the form fill, flattening, and optimization components, lopdf under the hood is very fast.

For a 1200 page pdf with over 10000 fields mapped, fully optimized, with lots of vector graphics, etc, on my development machine it takes about 3 seconds to generate include all the logic to go back and forth from our API (actix driven). The optimization process is the largest impact to production, but is vital for us.

Although this crate is low level, I'm pretty impressed with it's speed at parsing and manipulating PDF's.