ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
363 stars 59 forks source link

WASM: Text occasionally gets trailing garbage when building a pdf in memory #171

Open samsieber opened 2 hours ago

samsieber commented 2 hours ago

Details:

I am using pdfium-render to render both normal pdfs and pdfs I generate for the purpose of generating text overlays in an image related application I'm using. When I do both, I sometimes get extra data at the end of the text I generate.

So, I might call PdfPageTextObject::new(&document, "new test", font, font_size) but if I later call .text() on said PdfPageTextObject, I could get back "new testÿÿK". My theory at this point is that there's something to do with memory management going astray, but I'm not sure.

My reproduction repository is a simplification of code I'm using in my own code. My own code targets wasm and native; I've only tried to replicate it for wasm because in my own application I've only ever seen it crop up on wasm.

Here's the core of the code that creates the text object after adding them all, checks them all. It sometimes, but not always, differs.

// Add text line-by-line, handling newlines
    for line in text.lines() {
        log::info!("Creating text line: '{}'", line);
        let mut text_object = PdfPageTextObject::new(&document, line, font, font_size)?;
        text_object.set_fill_color(PdfColor::new(0, 0, 0, 255))?; // Set text color to black

        // Position the text on the page
        text_object.translate(PdfPoints::new(0.0), y_offset)?;
        y_offset -= PdfPoints::new(font_size.value * 1.4); // Adjust `y_offset` with `PdfPoints`

        // Add the text object to the page
        page.objects_mut().add_text_object(text_object)?;
    }
    for obj in page.objects().iter() {
        match &obj {
            PdfPageObject::Text(text) => {
                log::info!("Retreived text line: {}", text.text());
            }
            PdfPageObject::Path(_) => {}
            PdfPageObject::Image(_) => {}
            PdfPageObject::Shading(_) => {}
            PdfPageObject::XObjectForm(_) => {}
            PdfPageObject::Unsupported(_) => {}
        }
    }

The first loop is where I set the text, and the second loops checks the text; sometimes the two loops print different sets of strings, but the difference is always that there's more text than what I expect.

See the replication repository for more details; it's a modified version of the wasm example from this repository. It loads a normal pdf, renders it, and then tries to build a pdf in memory with text. If I reverse that order, the issue goes away.

samsieber commented 1 hour ago

I have further isolated the issue. It appears to be when we copy data over. Here's a new loop I've been using:

    // Add text line-by-line, handling newlines
    for line in text.lines() {
        log::info!("Creating text line: '{}'", line);
        let mut text_object = PdfPageTextObject::new(&document, line, font, font_size)?;
        log::info!("Reading before attaching: '{}'", text_object.text());
        text_object.set_text(line).unwrap();
        log::info!("Reading overwritten before attaching: '{}'", text_object.text());

        text_object.set_fill_color(PdfColor::new(0, 0, 0, 255))?; // Set text color to black

        // Position the text on the page
        text_object.translate(PdfPoints::new(0.0), y_offset)?;
        y_offset -= PdfPoints::new(font_size.value * 1.4); // Adjust `y_offset` with `PdfPoints`

        // Add the text object to the page
        let mut to = page.objects_mut().add_text_object(text_object)?;
        log::info!("Reading after attaching: '{}'", to.as_text_object().unwrap().text());
        to.as_text_object_mut().unwrap().set_text(line).unwrap();
        log::info!("Reading overwritten after attaching: '{}'", to.as_text_object().unwrap().text());
    }

And here's the output for that:

Creating text line: 'new test' [pdfium_render_text_garbage.js:454:13](http://localhost:4000/pdfium_render_text_garbage.js)
Reading before attaching: '' [pdfium_render_text_garbage.js:454:13](http://localhost:4000/pdfium_render_text_garbage.js)
Reading overwritten before attaching: '' [pdfium_render_text_garbage.js:454:13](http://localhost:4000/pdfium_render_text_garbage.js)
Reading after attaching: 'new testÿÿK' [pdfium_render_text_garbage.js:454:13](http://localhost:4000/pdfium_render_text_garbage.js)
Reading overwritten after attaching: 'new test' [pdfium_render_text_garbage.js:454:13](http://localhost:4000/pdfium_render_text_garbage.js)

Notably, I cannot fix the text after noticing that it was added incorrectly.