ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
360 stars 59 forks source link

Idiomatic Rust bindings for Pdfium

pdfium-render provides an idiomatic high-level Rust interface to Pdfium, the C++ PDF library used by the Google Chromium project. Pdfium can render pages in PDF files to bitmaps, load, edit, and extract text and images from existing PDF files, and create new PDF files from scratch.

    use pdfium_render::prelude::*;

    fn export_pdf_to_jpegs(path: &impl AsRef<Path>, password: Option<&str>) -> Result<(), PdfiumError> {
        // Renders each page in the PDF file at the given path to a separate JPEG file.

        // Bind to a Pdfium library in the same directory as our Rust executable.
        // See the "Dynamic linking" section below.

        let pdfium = Pdfium::default();

        // Load the document from the given path...

        let document = pdfium.load_pdf_from_file(path, password)?;

        // ... set rendering options that will be applied to all pages...

        let render_config = PdfRenderConfig::new()
            .set_target_width(2000)
            .set_maximum_height(2000)
            .rotate_if_landscape(PdfPageRenderRotation::Degrees90, true);

        // ... then render each page to a bitmap image, saving each image to a JPEG file.

        for (index, page) in document.pages().iter().enumerate() {
            page.render_with_config(&render_config)?
                .as_image() // Renders this page to an image::DynamicImage...
                .into_rgb8() // ... then converts it to an image::Image...
                .save_with_format(
                    format!("test-page-{}.jpg", index), 
                    image::ImageFormat::Jpeg
                ) // ... and saves it to a file.
                .map_err(|_| PdfiumError::ImageError)?;
        }

        Ok(())
    }

pdfium-render binds to a Pdfium library at run-time, allowing for flexible selection of system-provided or bundled Pdfium libraries and providing idiomatic Rust error handling in situations where a Pdfium library is not available. A key advantage of binding to Pdfium at run-time rather than compile-time is that a Rust application using pdfium-render can be compiled to WASM for running in a browser alongside a WASM-packaged build of Pdfium.

Examples

Short, commented examples that demonstrate all the major Pdfium document handling features are available at https://github.com/ajrcarey/pdfium-render/tree/master/examples. These examples demonstrate:

What's new

Note: upcoming release 0.9.0 will remove all deprecated items. For a complete list of deprecated items, see https://github.com/ajrcarey/pdfium-render/issues/36.

Release 0.8.26 relaxes the minimum supported Rust version to 1.61 based on user feedback, increments the pdfium_latest feature to pdfium_6721 to match new Pdfium release 6721 at https://github.com/bblanchon/pdfium-binaries, adds new crate features image_025, image_024, and image_023 to allow explicitly pinning the version of image that should be used by pdfium-render, sets image to image_025, and adjusts bookmark traversal so that bookmarks are returned in a more natural order, thanks to an excellent contribution from https://github.com/mlaiosa.

Release 0.8.25 establishes a minimum supported Rust version of 1.60 for pdfium-render, increments the pdfium_latest feature to pdfium_6666 to match new Pdfium release 6666 at https://github.com/bblanchon/pdfium-binaries, adds new crate features pdfium_use_skia, pdfium_use_win32, pdfium_enable_xfa, and pdfium_enable_v8 to make available certain Pdfium functions that require Pdfium to be built with specific compile-time flags, and adds bindings for all remaining FPDF_* functions in the Pdfium API to PdfiumLibraryBindings, an important milestone leading up to release 0.9.0.

Release 0.8.24 fixes a bug in certain string handling operations in the WASM bindings implementation, and introduces the ability to control the version of the Pdfium API used by pdfium-render. By default pdfium-render uses the latest released version of the Pdfium API, potentially requiring you to upgrade your Pdfium library if the latest release contains breaking changes. This can be inconvenient! To explicitly use an older API version, select one of the crate's Pdfium version feature flags when taking pdfium-render as a dependency in your project's Cargo.toml. See the "Crate features" section below for more information.

Release 0.8.23 updates the Pdfium bindings to the latest upstream release, adds new function PdfPageTextChar::text_object() for retrieving the page object containing a specific character in a text page, deprecates the PdfFont::name() function in favour of PdfFont::family() to match changes in upstream naming, adds new functions PdfFont::is_embedded() and PdfFont::data() for retrieving embedded font data, updates the examples/fonts.rs example to demonstrate the new functionality, and adjusts the implementation of some internal functions in response to upstream changes. Deprecated items will be removed in release 0.9.0.

Binding to Pdfium

pdfium-render does not include Pdfium itself. You have several options:

When compiling to WASM, packaging an external build of Pdfium as a separate WASM module is essential.

Dynamic linking

Binding to a pre-built Pdfium dynamic library at runtime is the simplest option. On Android, a pre-built libpdfium.so is packaged as part of the operating system (although recent versions of Android no longer permit user applications to access it); alternatively, you can package a dynamic library appropriate for your operating system alongside your Rust executable.

Pre-built Pdfium dynamic libraries suitable for runtime binding are available from several sources:

If you are compiling a native (i.e. non-WASM) build, and you place an appropriate Pdfium library in the same folder as your compiled application, then binding to it at runtime is as simple as:

    use pdfium_render::prelude::*;

    let pdfium = Pdfium::new(
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./")).unwrap()
    );

A common pattern used in the examples at https://github.com/ajrcarey/pdfium-render/tree/master/examples is to first attempt to bind to a Pdfium library in the same folder as the compiled example, and attempt to fall back to a system-provided library if that fails:

    use pdfium_render::prelude::*;

    let pdfium = Pdfium::new(
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
            .or_else(|_| Pdfium::bind_to_system_library())
            .unwrap() // Or use the ? unwrapping operator to pass any error up to the caller
    );

This pattern is used to provide an implementation of the Default trait, so the above can be written more simply as:

    use pdfium_render::prelude::*;

    let pdfium = Pdfium::default();

Static linking

The static crate feature offers an alternative to dynamic linking if you prefer to link Pdfium directly into your executable at compile time. This enables the Pdfium::bind_to_statically_linked_library() function which binds directly to the Pdfium functions compiled into your executable:

    use pdfium_render::prelude::*;

    let pdfium = Pdfium::new(Pdfium::bind_to_statically_linked_library().unwrap());

As a convenience, pdfium-render can instruct cargo to link to either a dynamically-built or a statically-built Pdfium library for you. To link to a dynamically-built library, set the PDFIUM_DYNAMIC_LIB_PATH environment variable when you run cargo build, like so:

    PDFIUM_DYNAMIC_LIB_PATH="/path/containing/your/dynamic/pdfium/library" cargo build

pdfium-render will pass the following flags to cargo:

    cargo:rustc-link-lib=dylib=pdfium
    cargo:rustc-link-search=native=$PDFIUM_DYNAMIC_LIB_PATH

To link to a statically-built library, set the path to the directory containing your library using the PDFIUM_STATIC_LIB_PATH environment variable when you run cargo build, like so:

    PDFIUM_STATIC_LIB_PATH="/path/containing/your/static/pdfium/library" cargo build

pdfium-render will pass the following flags to cargo:

    cargo:rustc-link-lib=static=pdfium
    cargo:rustc-link-search=native=$PDFIUM_STATIC_LIB_PATH

These two environment variables save you writing a custom build.rs yourself. If you have your own build pipeline that links Pdfium statically into your executable, simply leave these environment variables unset.

Note that the path you set in either PDFIUM_DYNAMIC_LIB_PATH or PDFIUM_STATIC_LIB_PATH should not include the filename of the library itself; it should just be the path of the containing directory. You must make sure your library is named in the appropriate way for your target platform (libpdfium.so or libpdfium.a on Linux and macOS, for example) in order for the Rust compiler to locate it.

Depending on how your Pdfium library was built, you may need to also link against a C++ standard library. To link against the GNU C++ standard library (libstdc++), use the optional libstdc++ feature. pdfium-render will pass the following additional flag to cargo:

    cargo:rustc-link-lib=dylib=stdc++

To link against the LLVM C++ standard library (libc++), use the optional libc++ feature. pdfium-render will pass the following additional flag to cargo:

    cargo:rustc-link-lib=dylib=c++

Alternatively, use the link-cplusplus crate to link against a C++ standard library. link-cplusplus offers more options for deciding which standard library should be selected, including automatically selecting the build platform's installed default.

pdfium-render will not build Pdfium for you; you must build Pdfium yourself, source a pre-built static archive from elsewhere, or use a dynamically built library downloaded from one of the sources listed above in the "Dynamic linking" section. If you wish to build a static library yourself, an overview of the build process - including a sample build script - is available at https://github.com/ajrcarey/pdfium-render/issues/53.

Compiling to WASM

See https://github.com/ajrcarey/pdfium-render/tree/master/examples for a full example that shows how to bundle a Rust application using pdfium-render alongside a pre-built Pdfium WASM module for inspection and rendering of PDF files in a web browser.

Certain functions that access the file system are not available when compiling to WASM. In all cases, browser-specific alternatives are provided, as detailed at the link above.

At the time of writing, the WASM builds of Pdfium at https://github.com/bblanchon/pdfium-binaries/releases are compiled with a non-growable WASM heap memory allocator. This means that attempting to open a PDF document longer than just a few pages will result in an unrecoverable out of memory error. The WASM builds of Pdfium at https://github.com/paulocoutinhox/pdfium-lib/releases are recommended as they do not have this problem.

Multi-threading

Pdfium makes no guarantees about thread safety and should be assumed not to be thread safe. The Pdfium authors specifically recommend that parallel processing, not multi-threading, be used to process multiple documents simultaneously.

pdfium-render achieves thread safety by locking access to Pdfium behind a mutex; each thread must acquire exclusive access to this mutex in order to make any call to Pdfium. This has the effect of sequencing all calls to Pdfium as if they were single-threaded, even when using pdfium-render from multiple threads. This approach offers no performance benefit, but it ensures that Pdfium will not crash when running as part of a multi-threaded application.

An example of safely using pdfium-render as part of a multi-threaded parallel iterator is available at https://github.com/ajrcarey/pdfium-render/tree/master/examples.

Crate features

This crate provides the following optional features:

Crate features for selecting image versions

Release 0.8.26 introduced new features to explicitly control the version of the image crate used by pdfium-render:

Crate features for selecting Pdfium API versions

Release 0.8.24 introduced new features to explicitly control the version of the Pdfium API used by pdfium-render:

A small number of functions in the Pdfium API are gated behind compile-time flags when compiling Pdfium. pdfium-render release 0.8.25 introduced new crate features to control whether these functions are included in the PdfiumLibraryBindings trait:

Default features

The image, thread_safe, and pdfium_latest features are enabled by default. All other features are disabled by default.

Minimum supported Rust version

With the image feature enabled, the minimum supported Rust version of pdfium-render will align with the minimum supported Rust version of image (at the time of writing, Rust 1.79 for image version 0.25). With the image feature disabled, the minimum supported Rust version of pdfium-render is 1.61.

Porting existing Pdfium code from other languages

The high-level idiomatic Rust interface provided by pdfium-render is built on top of raw FFI bindings to the Pdfium API defined in the PdfiumLibraryBindings trait. It is completely feasible to use these raw FFI bindings directly if you wish, making porting existing code that uses the Pdfium API trivial while still gaining the benefits of late binding and WASM compatibility. For instance, the following code snippet (taken from a C++ sample):

    string test_doc = "test.pdf";

    FPDF_InitLibrary();
    FPDF_DOCUMENT doc = FPDF_LoadDocument(test_doc, NULL);
    // ... do something with doc
    FPDF_CloseDocument(doc);
    FPDF_DestroyLibrary();

would translate to the following Rust code:

    let pdfium = Pdfium::default();
    let bindings = pdfium.bindings();
    let test_doc = "test.pdf";

    bindings.FPDF_InitLibrary();
    let doc = bindings.FPDF_LoadDocument(test_doc, None);
    // ... do something with doc
    bindings.FPDF_CloseDocument(doc);
    bindings.FPDF_DestroyLibrary();

Pdfium's API uses three different string types: classic C-style null-terminated char arrays, UTF-8 byte arrays, and a UTF-16LE byte array type named FPDF_WIDESTRING. For functions that take a C-style string or a UTF-8 byte array, pdfium-render's binding will take the standard Rust &str type. For functions that take an FPDF_WIDESTRING, pdfium-render exposes two functions: the vanilla FPDF_*() function that takes an FPDF_WIDESTRING, and an additional FPDF_*_str() helper function that takes a standard Rust &str and converts it internally to an FPDF_WIDESTRING before calling Pdfium. Examples of functions with additional _str() helpers include FPDFBookmark_Find(), FPDFText_SetText(), FPDFText_FindStart(), FPDFDoc_AddAttachment(), FPDFAnnot_SetStringValue(), and FPDFAttachment_SetStringValue().

The PdfiumLibraryBindings::get_pdfium_utf16le_bytes_from_str() and PdfiumLibraryBindings::get_string_from_pdfium_utf16le_bytes() utility functions are provided for converting to and from FPDF_WIDESTRING in your own code.

Some Pdfium functions return classic C-style integer boolean values, aliased as FPDF_BOOL. The PdfiumLibraryBindings::TRUE(), PdfiumLibraryBindings::FALSE(), PdfiumLibraryBindings::is_true(), PdfiumLibraryBindings::to_result(), and PdfiumLibraryBindings::bool_to_pdfium() utility functions are provided for converting to and from FPDF_BOOL in your own code.

Image pixel data in Pdfium is encoded in either three-channel BGR or four-channel BGRA. The PdfiumLibraryBindings::bgr_to_rgba(), PdfiumLibraryBindings::bgra_to_rgba(), PdfiumLibraryBindings::rgb_to_bgra(), and PdfiumLibraryBindings::rgba_to_bgra() utility functions are provided for converting between RGB and BGR image data in your own code.

Development status

As at Pdfium release pdfium_6721 there are 426 FPDF_* functions in the Pdfium API. Bindings to these functions are available in the PdfiumLibraryBindings trait.

The initial focus of this crate was on rendering pages in a PDF file; consequently, high-level implementations of FPDF_* functions related to page rendering were prioritised. By 1.0, the functionality of all FPDF_* functions exported by all Pdfium modules will be available, with the exception of certain functions specific to interactive scripting, user interaction, and printing.

Some functions and type definitions have been renamed or revised since their initial implementations. The initial implementations are still available but are marked as deprecated. These deprecated items will be removed in release 0.9.0.

Version history