Add example demonstrating image extraction from an existing document.

richcanvas commented 2 years ago

hi, can you add examples on extracting embedded fonts and images?

ajrcarey commented 2 years ago

Hi @sunnyawake6, thank you for raising the issue.

Pdfium (and therefore pdfium-render) does not support extracting embedded fonts. There are two reasons for this. First, only the font glyphs actually required by the document are included in the document; this is a feature called font subsetting. For instance, if you use a custom font for a drop-cap, and thus use only a single letter from the font, only the glyph for that single letter will be embedded in the document; all other font information will be discarded. So even if you could extract font data from a document, it would probably be incomplete. The second reason you cannot directly extract embedded fonts is to do with licensing. If it was easy to extract complete font resources, it would be trivial to violate licensing agreements made with font creators.

Instead, you may wish to consider using one of the the methods suggested at https://stackoverflow.com/questions/3488042/how-can-i-extract-embedded-fonts-from-a-pdf-as-valid-font-files. Please be aware that this may violate the licenses of font creators. I cannot assist you further with this.

Extracting images is usually straight-forward, since (unlike fonts) images are stored intact in the PDF file. Unless the PDF's security settings have specifically been set to prevent image extraction, you can use a recipe like the following to extract images from a document:

Pdfium::new(
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
            .or_else(|_| Pdfium::bind_to_system_library())?,
    )
    .load_pdf_from_file("test/image-test.pdf", None)?
    .pages()
    .iter()
    .enumerate()
    .for_each(|(page_index, page)| {
        // For each page in the document, output the images on the page to separate files.

        page.objects()
            .iter()
            .enumerate()
            .for_each(|(object_index, object)| {
                if let Some(image) = object.as_image_object() {
                    // This page object contains an image. Attempt to extract the raw image data.

                    if let Ok(image) = image.get_raw_image() {
                        // Export the raw image data as a JPG.

                        assert!(image
                            .save_with_format(
                                format!(
                                    "image-test-page-{}-image-{}.jpg",
                                    page_index, object_index
                                ),
                                ImageFormat::Jpeg,
                            )
                            .is_ok());
                    }
                }
            });
    });

You have two options when extracting images. You can extract the raw image data, exactly as it was embedded in the PDF file; this what the code above does. The extracted image will ignore any transforms (e.g. rotation, skew, etc.) and image filters that may have been applied to the page object containing the image data, so it may look different to how the image appears when viewed as part of the PDF document. Alternatively you can extract the processed image data, with any transforms and image filters applied during rendering; an image extracted in this way will look exactly the same as it does when viewed as part of the PDF document. The two functions you want are PdfPageImageObject::get_raw_image() and PdfPageImageObject::get_processed_image().

The code above is available as a new example in examples/image_extract.rs.

richcanvas commented 2 years ago

thanks very much. as for font extracting, i'll explore further. several months ago i experimented with PDFBox and can extract various font data successfully. maybe pdfium-render can provider api to get font data with some similar license warning mechanism?

ajrcarey commented 2 years ago

That's an interesting idea. PDFBox does that by directly processing the document's resource dictionaries and extracting the raw font data from the content streams embedded in the PDF file. Pdfium doesn't provide an API to directly access resource dictionaries or raw content streams, so pdfium-render cannot take the same approach. If Pdfium ever adds an API for accessing resource dictionaries I will revisit this, but for now you will need to use another approach; Pdfium (and therefore pdfium-render) cannot do what you are asking.

ajrcarey commented 2 years ago

Packaged examples/image_extract.rs for inclusion in release 0.7.15. Updated examples/README.md.

ajrcarey commented 3 months ago

Hi @richcanvas , I thought you might like to know that Pdfium upstream has added the ability to extract embedded font data from a document. The PdfFont::data() function has been added to pdfium-render in response. The examples/fonts.rs example has been updated, showing you how to access the embedded data. You can track further changes to this feature in #152.

The changes will be released as crate version 0.8.23 and are available now by taking pdfium-render as a git dependency in your Cargo.toml file.

ajrcarey / pdfium-render

Add example demonstrating image extraction from an existing document. #41