ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

Loading library once for optimum performance #59

Closed hhio618 closed 1 year ago

hhio618 commented 1 year ago

Hi, we are working on a project that needs maximum performance for bulk pdf preview generation on an android JNI bridge. Currently, we're loading the library bindings on every call to the library and we're not sure if it's performance efficient. Could you please give me an example demonstrating loading the library once and then using it on each call? https://github.com/ARK-Builders/ARK-Navigator/pull/271

ajrcarey commented 1 year ago

Hi @hhio618, good to hear from you.

The impact of recreating the library bindings on each call should be minimal (i.e. milliseconds). However, if you want to get it as close to zero as possible, I suggest one of the following approaches:

If you are not currently storing any shared state in your Rust code, then statically binding is the way to go.

I don't have a specific long-form example of this but if you give me an overview of your set of functions (doesn't need to be the entire code), I can write up a sketch for you of how option 2 would look.

ajrcarey commented 1 year ago

PS if you are rendering/manipulating content from the same PDF document over and over again, then there is a performance impact associated with repeatedly opening and closing the PDF document. This will be far more noticeable than loading the library bindings on every call and I would focus your attention on this area. You want to hold onto your PDF objects (document, page, etc.) for as long as you can if you want maximum performance.

hhio618 commented 1 year ago

Hi @ajrcarey, Thanks for the info!

I don't have a specific long-form example of this but if you give me an overview of your set of functions (doesn't need to be the entire code), I can write up a sketch for you of how option 2 would look.

Sure, this is what I'm currently doing:

use std::{
    env,
    io::{Read, Seek},
    path::PathBuf,
};

use image::DynamicImage;

use pdfium_render::prelude::*;

pub enum PDFQuality {
    High,
    Medium,
    Low,
}
fn initialize_pdfium() -> Box<dyn PdfiumLibraryBindings> {
    let out_path = env!("OUT_DIR");
    let pdfium_lib_path =
        PathBuf::from(&out_path).join(Pdfium::pdfium_platform_library_name());
    let bindings = Pdfium::bind_to_library(
        #[cfg(target_os = "android")]
        Pdfium::pdfium_platform_library_name_at_path("./"),
        #[cfg(not(target_os = "android"))]
        pdfium_lib_path.to_str().unwrap(),
    )
    .or_else(|_| Pdfium::bind_to_system_library());

    match bindings {
        Ok(binding) => binding,
        Err(e) => {
            panic!("{:?}", e)
        }
    }
}
pub fn render_preview_page<R>(data: R, quailty: PDFQuality) -> DynamicImage
where
    R: Read + Seek + 'static,
{
    let render_cfg = PdfBitmapConfig::new();
    let render_cfg = match quailty {
        PDFQuality::High => render_cfg.set_target_width(2000),
        PDFQuality::Medium => render_cfg,
        PDFQuality::Low => render_cfg.thumbnail(50),
    }
    .rotate_if_landscape(PdfBitmapRotation::Degrees90, true);
    Pdfium::new(initialize_pdfium())
        .load_pdf_from_reader(data, None)
        .unwrap()
        .pages()
        .get(0)
        .unwrap()
        .get_bitmap_with_config(&render_cfg)
        .unwrap()
        .as_image()
}
ajrcarey commented 1 year ago

Thank you for the sample, that's excellent.

So, the time cost here is the call to load_pdf_from_reader(), which is called every time your calling code invokes render_review_page(). Because your PdfiumLibraryBindings are being instantiated every time render_review_page() is invoked, the call to load_pdf_from_reader() must reinflate the in-memory representation of the PDF document you want to render. Depending on the size of the document, this can take a noticeable amount of time. My suggestion is that you try to focus on avoiding reloading the document.

This approach involves more boilerplate, because you must now introduce a lazy_static and manage it, but it will definitely be faster. Depending on the size of your document, it may be noticeably faster.

I'm not sure how the rendered bitmap data is transferred from Rust to your calling code, but you want to avoid a copy there if at all possible. (I'm not sure how much control over that you get from JNI.) The larger the bitmap image, the more noticeable the latency introduced by a copy will be.

ajrcarey commented 1 year ago

PS after thinking about it a bit more, you may not even need an initialise_pdfium() function because that will take place automatically during your lazy_static (or once_cell, if you prefer that approach rather than lazy_static) setup.

It's possible lifetimes might be problematic when working with these static initialisers. I am happy to help you with that.

hhio618 commented 1 year ago

PS after thinking about it a bit more, you may not even need an initialise_pdfium() function because that will take place automatically during your lazy_static (or once_cell, if you prefer that approach rather than lazy_static) setup.

It's possible lifetimes might be problematic when working with these static initializers. I am happy to help you with that.

I ran into some problems while trying this, Would you please show me a sample snippet?

ajrcarey commented 1 year ago

Sure. Based on your sample code, I was thinking along the lines of the following:

use image::DynamicImage;
use once_cell::sync::OnceCell;
use pdfium_render::prelude::*;
use std::{
    env,
    io::{Read, Seek},
    path::PathBuf,
};

static PDFIUM: OnceCell<Pdfium> = OnceCell::new(); // static initializers must impl Sync + Send

pub enum PDFQuality {
    High,
    Medium,
    Low,
}

fn initialize_pdfium() {
    let out_path = env!("OUT_DIR");
    let pdfium_lib_path = PathBuf::from(&out_path).join(Pdfium::pdfium_platform_library_name());
    let bindings = Pdfium::bind_to_library(
        #[cfg(target_os = "android")]
        Pdfium::pdfium_platform_library_name_at_path("./"),
        #[cfg(not(target_os = "android"))]
        pdfium_lib_path.to_str().unwrap(),
    )
    .or_else(|_| Pdfium::bind_to_system_library())
    .unwrap();

    PDFIUM.set(Pdfium::new(bindings)); // Instead of returning the bindings, we cache them in the static initializer
}

pub fn render_preview_page<R>(data: R, quailty: PDFQuality) -> DynamicImage
where
    R: Read + Seek + 'static,
{
    let render_cfg = PdfBitmapConfig::new();
    let render_cfg = match quailty {
        PDFQuality::High => render_cfg.set_target_width(2000),
        PDFQuality::Medium => render_cfg,
        PDFQuality::Low => render_cfg.thumbnail(50),
    }
    .rotate_if_landscape(PdfBitmapRotation::Degrees90, true);

    PDFIUM
        .get() // Retrieves the previously-created Pdfium instance from the static initializer
        .unwrap()
        .load_pdf_from_reader(data, None)
        .unwrap()
        .pages()
        .get(0)
        .unwrap()
        .get_bitmap_with_config(&render_cfg)
        .unwrap()
        .as_image()
}

There's a lot of unwrap()-ping going on here, which isn't great for safety, but for a proof-of-concept I guess it's ok for now.

Creating a static instance of Pdfium requires that struct to implement both the Sync and Send traits, which it currently does not do. I have made a commit that adds a new feature, sync, that adds this. Set pdfium-render as a git dependency in your Cargo.toml, and activate both the sync and thread_safe features. You should now be able to compile the example above.

I have confirmed that the sample above compiles, which is not quite the same as confirming that it works :) Whether it is safe for the Pdfium struct to implement Sync and Send is open to some debate, since Pdfium itself is not thread-safe. But this is the approach I would take to start with. pdfium-render does marshall calls to Pdfium in a thread-safe manner, even when running in multi-threaded code, so long as the thread_safe feature is enabled, so in theory it should work :)

If this does work without segfaulting Pdfium, and you are repeatedly reading from the same document in render_preview_page(), then the next step would be to try to get that PdfDocument reference into a static cell as well. This would save you from repeatedly opening and closing your document, which I think is likely to be the biggest source of noticeable performance lag.

Your sample code only reads from a document; it doesn't change the document. If you did want to change an existing document, then another way you could improve performance would be to change pdfium-render's default approach to regeneration of content streams. (The default setting is very convenient, but not optimal for performance when making many changes to a document.) But if you're only rendering existing documents, then you won't need to worry about that.

hhio618 commented 1 year ago

@ajrcarey Thanks for your time, this helps us solve our performance problem!

ajrcarey commented 1 year ago

That's great - so it does work, then? Pdfium doesn't segfault or otherwise complain?

hhio618 commented 1 year ago

Works great! I didn't see any problem.

ajrcarey commented 1 year ago

Excellent. I will run some more tests here, but all going well the new sync feature will be released in crate version 0.7.26. If I don't detect any problems in my tests, I may even make it a default feature.

ajrcarey commented 1 year ago

Tests showed no problems in enabling sync feature by default. Updated README.md. Scheduled for inclusion in crate version 0.7.26.