ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

Support extracting grayscale images #74

Closed stephenjudkins closed 1 year ago

stephenjudkins commented 1 year ago

I've confirmed this works with a PDF that I have but could add a test case if you'd like!

ajrcarey commented 1 year ago

Brilliant, thank you so much for plugging this gap! Do you have a sample PDF file containing a grey-scale image?

stephenjudkins commented 1 year ago

Here's one! gray.pdf

ajrcarey commented 1 year ago

Great, thank you. But when I look at the image format of the image on that page, Pdfium tells me it's BGRA, not grayscale, and rendering it doesn't exercise your code path (in fact, I can delete your code entirely and the image renders perfectly fine).

Can you provide an example for which pdfium-render previously generated an error during image page object rendering, thus necessitating your code change?

stephenjudkins commented 1 year ago

Confirmed that this (very large!) image does exercise the problem. Sorry, resizing it down to a smaller image converted the colorspace.

_Slingin_SammyBaugh,_Washington,_D.C.,_Sept._11.Slinging_Sammy__Baugh,_new_addition_to_the_Washington_Redskins,_the_Texas_Christian_U._star_is_rated_as_one_of_the_greatest_of_this_LCCN2016877914 (1).jpg.pdf

ajrcarey commented 1 year ago

Great, thank you. When I open the file in PdfExplorer it does seem to confirm that the image colorspace is DeviceGray, but again Pdfium identifies it (rightly or wrongly) as BGRA and rendering the image object to an image doesn't exercise your code path. Perhaps Pdfium is doing something clever in the background that is obfuscating things.

Are you able to share the document you were working with that caused you to initially discover the missing PdfBitmapFormat::Gray handler in PdfPageImageObject::get_image_from_bitmap_handle()?

stephenjudkins commented 1 year ago

Here's a reduced example of the code I'm using to exercise this:

use pdfium_render::prelude::*;

fn go() -> Result<()> {
    let pdfium = Pdfium::new(
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
            .or_else(|_| Pdfium::bind_to_system_library())?,
    );

    let doc = pdfium.load_pdf_from_file("big.pdf", None)?;

    for page in doc.pages().iter() {
        for object in page.objects().iter() {
            if let Some(image) = object.as_image_object() {
                match image.get_raw_image() {
                    Ok(i) => println!("{} x {}", i.width(), i.height()),
                    Err(e) => println!("{:?}", e)
                };
            }
        }
    }

    Ok(())
}

fn main() {
    go().unwrap();
}

When I run this with my branch of pdfium-render:

stephen@boris-godunov image-handler % cargo run --release
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/image_handler`
8104 x 10140

When I run with the latest cargo release (0.7.29) of pdfium-render:

stephen@boris-godunov image-handler % cargo run --release
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/image_handler`
ImageError

I've gone and added some println! debugging to my local pdfium-render source code and verified that, if I remove the new match, we are hitting the codepath where we match on PdfBitmapFormat::Gray.

Perhaps there is a different version of pdfium we're using?

stephen@boris-godunov image-handler % shasum -a 256 libpdfium.dylib 
5b28effbe31b7327e3e6485acc1d999cccf52c21815154f0a50779221daac3c3  libpdfium.dylib

I got the prebuilt library from https://github.com/bblanchon/pdfium-binaries/releases/tag/chromium%2F5579. Let me try the more recent version and see what happens....

stephenjudkins commented 1 year ago

I tried the latest release from that repo (https://github.com/bblanchon/pdfium-binaries/releases/tag/chromium%2F5619) and I'm still seeing the ImageError

stephenjudkins commented 1 year ago

(also, if not's clear, I'm on macOS/arm64)

ajrcarey commented 1 year ago

Ah, I see the problem, and it's totally PEBKAC on my part. I was using PdfPageImageObject::get_processed_image() rather than PdfPageImageObject::get_raw_image() in my test code. It makes perfect sense that get_processed_image() would, y'know, process the color space :)

With get_raw_image(), I can indeed reproduce your original problem and I can confirm your code change resolves it. Many thanks again for plugging this gap. Your fix will be released as part of crate version 0.7.32 shortly.

stephenjudkins commented 1 year ago

Great! Thank you so much for your work here.