ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
363 stars 59 forks source link

Add text objects editing functionality #17

Closed AbdesamedBendjeddou closed 2 years ago

AbdesamedBendjeddou commented 2 years ago

The easiest way IMO to edit a text object in a pdf file is if we can directly edit the raw object

BT /F13 12 Tf 288 720 Td (ABC) Tj ET As a side effect we can easily edit many other properties of the text object, however I don't know how easy it is to implement this or if it's even possible. especially that I don't exactly understand how are document level objects and cross references tables and other stuff that I may not know about relate to the content of the text object

ajrcarey commented 2 years ago

Hi Abdesamed, yes, I'm guessing that's what Pdfium does under the hood. However, when working with Pdfium you do not edit the raw objects directly, but instead use Pdfium's collection of FPDF* functions to create, read, or update PDF objects in a more controlled manner. pdfium-render builds on this further by providing a more idiomatic Rust interface on top of the FPDF functions provided by Pdfium, although you don't have to use that interface; you can access the FPDF_ functions directly if you prefer.

Currently it's easy enough to read the text from objects on a page, using something like:

Pdfium::new(bindings)
    .load_pdf_from_file("test/text-test.pdf", None).unwrap()
    .pages()
    .iter()
    .for_each(|page| {
        page.objects()
            .iter()
            .filter_map(|object| object.as_text_object())
            .map(|object| object.text()))
            .map(|text| {
                // ... Do something with the text in the object
            })
    });

This iterates over every PdfPageTextObject in every PdfPage, extracting the text from each using the PdfPageTextObject::text() getter function.

To edit the text, I suggest adding a PdfPageTextObject::set_text() function. This would call a Pdfium FPDF_* function to update the text in place. You would need to save the updated in-memory structure to a new file to persist the changes.

Did you just want to update the text, or other object properties (render mode, font size, font face) as well? Are you after a simple search-and-replace function, or something more complex?

AbdesamedBendjeddou commented 2 years ago

Yes, I want to be able update some properties like boldness but I don't know if this is possible withing the same object. like can the same text object have some bold words and other words that are regular. If not then we'll need to implement a way to replace one text object with multiple text objects I don't know if that is feasible

ajrcarey commented 2 years ago

I'm not an expert, but the impression I have from reading the PDF Reference Manual is that each text object has a single font specification. The size of the text fragment (specified by TJ) can be adjusted because the text rendering matrix can be used to apply a scaling effect, but the weight of the font - i.e. whether it is bold or not - is a property of the font, not the text, so cannot change within a single text object.

It certainly seems that's what the Pdfium authors thought as well, because each page text object is linked to a single font specification. If you want to edit some text and change the font at the same time (even if it's just the font weight you are changing), I think you'll need to split the text object into two and position them appropriately.

I don't think it's necessarily a problem to add new text objects to a page - certainly Pdfium lets you do that - but you might find it fiddly to position them correctly, that's all.

ajrcarey commented 2 years ago

(For reference, the Pdfium functions for editing/creating page text objects are FPDFPageObj_CreateTextObj() and FPDFText_SetText(), if you want to look at them. Documentation isn't great outside just reading the source code at https://pdfium.googlesource.com/pdfium/+/HEAD/public/fpdf_edit.h.)

AbdesamedBendjeddou commented 2 years ago

It seems positioning the text will be a problem like you said. let's say we had the object BT /F13 12 Tf 288 720 Td (Hello World!) Tj ET splitting it to a bold word and a regular world will give us BT /F1 12 Tf 288 720 Td (Hello) Tj ET and BT /F13 12 Tf X Y Td (World!) Tj ET Now we need to figure out how to calculate the new Td field. taking into account text size and font.

ajrcarey commented 2 years ago

Once a text object is committed to a page, it's possible to retrieve its computed size and boundaries. So aligning one text object next to another shouldn't be too difficult, it's just a bit more work than having everything in a single text object.

Let me make some progress on adding new objects and editing the text of existing objects. There's a bit of infrastructure to build in terms of making pages and documents mutable (everything is read-only at the moment), and this week is a busy week on other projects, so give me a few days to work on that.

AbdesamedBendjeddou commented 2 years ago

Okay. in the mean time I will tinker more with your crate. to be more familiar with it, and probably read the pdf specification more in depth. Can you point me to how I can retrieve the size and boundaries of an object? I'd like to make some progress on my project while waiting for you to finish your work

ajrcarey commented 2 years ago

Sure, you want the PdfPageObjectCommon::bounds() trait function:

https://docs.rs/pdfium-render/0.6.0/pdfium_render/page_object/trait.PdfPageObjectCommon.html

It's available on every PdfPageObject. See https://github.com/ajrcarey/pdfium-render/blob/master/examples/objects.rs for an example: after line 30, you can add the following line to output the bounds of each object:

println!("bounds: {:#?}", object.bounds());

AbdesamedBendjeddou commented 2 years ago

Thanks, I very much appreciate your help.

ajrcarey commented 2 years ago

No problem. I'm nearly done with the initial editing infrastructure. Can you provide a one-page PDF sample that demonstrates what you're trying to achieve? It would help me design the functions that can best fit what you're trying to do.

Pdfium does not provide functions for changing the font or font size of a text object after it's initially created, but I think it's possible to provide some functions that handle cloning the source object with updated properties, removing the original from the underlying page, and adding the cloned replacement to the page at the original position.

AbdesamedBendjeddou commented 2 years ago

Here is the PDF you requested. The first paragraph is what the original PDF would be, the second paragraph shows what the edited PDF should look like. Let me know if you need anything else sample.pdf

ajrcarey commented 2 years ago

Thanks for that. So it's literally just the formatting you want to change, then?

Running the examples/objects.rs example across this file gives us a report on how the text objects are defined in the file. We can see there's at least one text object for each line of text, and then where the font weight changes, there's a change in text object. So this confirms our earlier theory that each text object is limited to a single font setting which includes the weight.

The old iText library, from which the current OpenPDF Java library is inherited, includes the concept of a Paragraph object which can wrap multiple elements (including individual text objects), all of which are positioned and manipulated as a group rather than individually. I think we'll need to have something similar to make your job as simple as possible. The basic idea would be:

My proposal is to publish all the infrastructure stuff, including all the mutable bindings, as crate version 0.7.0, then add the Paragraph feature for you in version 0.7.1. I think it's realistic to get it all done in the next week or so if you can wait that long.

AbdesamedBendjeddou commented 2 years ago

I was thinking of:

but I think your idea is better, it makes things really simple. Please take your time. I'm here if you want help testing the new features

ajrcarey commented 2 years ago

Yes, that approach sounds like it would work. You'd have to position everything manually, that's all.

I'm going to try to get 0.7.0 out some time tomorrow (Monday), and that should include everything you need for your approach, if you want to try it. In the meantime, I'll work on 0.7.1.

Pdfium lets you mutate objects willy-nilly once you have pointers to them. It's a big reason, I think, why Pdfium isn't thread-safe in its current form and I'd like for pdfium-render's high-level interface to try to avoid that. So I am trying to adopt an approach where the high-level interface doesn't let you mutate any existing page object - you can only set a page object's properties once, at the time you create it. Once a page object has been created, you can read its properties and add it to a page and remove it from a page, but otherwise it's immutable. It's stricter but it is much easier to reason about, I think.

It shouldn't affect your approach too much, it just means instead of applying changes to any existing text object, you'd have to create new ones. You may have been going to do that anyway.

ajrcarey commented 2 years ago

Ok, I'm going to work a bit more on the documentation before I publish a new crate version, but I've pushed a commit now that should let you make some progress. You can access it by referencing pdfium-render as a git dependency in your Cargo.toml.

Here's some sample code to get you underway. It just alternates the font of every word - you want something a bit more specific than that, I think. But it should give you the general idea.

use pdfium_render::prelude::*;
use std::fs::File;

fn main() {
    let bindings = Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
        .or_else(|_| Pdfium::bind_to_system_library());

    match bindings {
        Ok(bindings) => {
            let pdfium = Pdfium::new(bindings);

            // Create a new blank document containing a single page with the desired sample text.

            let document = pdfium.create_new_pdf().unwrap();

            let mut pages = document.pages();

            let mut page = pages.create_page_at_start(PdfPagePaperSize::a4()).unwrap();

            // Desired sample text:

            let sample_text = vec!(
                "TITLE HERE",
                "",
                "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum",
                "has been the industry's standard dummy text ever since the 1500s, when an unknown",
                "printer took a galley of type and scrambled it to make a type specimen book. It has",
                "survived not only five centuries, but also the leap into electronic typesetting, remaining",
                "essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets",
                "containing Lorem Ipsum passages, and more recently with desktop publishing software like",
                "Aldus PageMaker including versions of Lorem Ipsum.",
            );

            // Add the sample text to the page. We create a separate text object
            // for each line in the sample text.

            let regular_font = PdfFont::helvetica(&document);

            let bold_font = PdfFont::helvetica_bold(&document);

            let font_size = PdfPoints::new(12.0);

            let line_spacing = font_size * 1.5;

            let line_left = PdfPoints::new(50.0);

            let mut line_top = PdfPoints::new(700.0);

            for (index, line) in sample_text.iter().enumerate() {
                let font = {
                    // Make the first line bold, all other lines regular.

                    if index == 0 {
                        &bold_font
                    } else {
                        &regular_font
                    }
                };

                page.objects_mut()
                    .create_text_object(line_left, line_top, line, font, font_size)
                    .unwrap();

                line_top -= line_spacing;
            }

            // Save the initial state to a file.

            document
                .save_to_writer(File::create("before.pdf").unwrap())
                .unwrap();

            // Now we transform the text objects on the page. We retrieve all objects on the page,
            // filtering out just the text objects. For each retrieved text object we split the
            // text into separate words, then iterate through each word, creating a separate text
            // object for each and applying bold font to alternate words. We place the newly
            // created text objects beneath the original set of text objects.

            // We cannot iterate over the text objects in the page while simultaneously holding
            // a mutable reference to that page. Instead, we build a list of new text objects
            // separately, then add them all to the page in a single operation.

            let mut new_objects = Vec::new();

            let mut bold = true;

            let word_separation = PdfPoints::new(4.0);

            for object in page.objects().iter() {
                if let Some(line) = object.as_text_object() {
                    let line_left = object.bounds().unwrap().left;

                    let line_top = object.bounds().unwrap().top - PdfPoints::new(400.0);

                    let mut word_left = line_left;

                    line.text()
                        .split(" ")
                        .map(|word| {
                            // Create a new text object for this word.

                            let mut object = PdfPageTextObject::new(
                                &document,
                                word,
                                if bold { &bold_font } else { &regular_font },
                                font_size,
                            )
                            .unwrap();

                            // Set the position of this word.

                            object.translate(word_left, line_top);

                            // Set the start position of the next word.

                            word_left += object.width().unwrap() + word_separation;

                            // Switch the bold status for the next word.

                            bold = !bold;

                            object
                        })
                        .for_each(|object| new_objects.push(object));
                }
            }

            // Add all new objects to the page.

            for new_object in new_objects.drain(..) {
                page.objects_mut().add_text_object(new_object).unwrap();
            }

            // Save the result to a file.

            document
                .save_to_writer(File::create("after.pdf").unwrap())
                .unwrap();
        }
        Err(err) => eprintln!("Error loading pdfium library: {:#?}", err),
    }
}

It creates both a before.pdf and an after.pdf. Attached is an example of the after.pdf.

after.pdf

AbdesamedBendjeddou commented 2 years ago

Hi, I really can't thank you enough, I think this is all what I need. I'm testing it now. btw there is a little error when compiling the library src\utils.rs, line 130 mismatched types expected u32, found u64

ajrcarey commented 2 years ago

No problem, happy to help!

Regarding that error: can I ask what architecture you're compiling on?

I noticed the same thing when compiling to WASM. (I'm working on the WASM bindings at the moment.)

I think you can fix it for all architectures by changing line 127 to this:

let content_length = reader.seek(SeekFrom::End(0)).unwrap_or(0) as c_ulong;

AbdesamedBendjeddou commented 2 years ago

the fix worked thanks, I'm compiling on x86-64-bit machine

ajrcarey commented 2 years ago

Updated and reviewed generated documentation. Updated READMEs. Updated examples. Added WASM binding implementations.

ajrcarey commented 2 years ago

Waiting for completion of #8 before publishing new crate version 0.7.0.

ajrcarey commented 2 years ago

Bumped crate version to 0.7.0 and published to crates.io.