ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

`get_processed_image` gives reversed byte order #50

Closed Victor-N-Suadicani closed 2 years ago

Victor-N-Suadicani commented 2 years ago

So I tried the following:

    let document = pdfium.load_pdf_from_file(&args.file, None)?;

    for (page_n, page) in document.pages().iter().enumerate() {
        for (object_n, object) in page.objects().iter().enumerate() {
            match &object {
                PdfPageObject::Text(_) => println!("Got Text object"),
                PdfPageObject::Path(_) => println!("Got Path object"),
                PdfPageObject::Image(image_obj) => {
                    println!("Got Image object");
                    let image = image_obj.get_processed_image(&document)?;
                    image.save(format!("{page_n:0>3}_{object_n:0>3}.jpeg"))?;
                }
                PdfPageObject::Shading(_) => println!("Got Shading object"),
                PdfPageObject::FormFragment(_) => println!("Got FormFragment object"),
                PdfPageObject::Unsupported(_) => println!("Got Unsupported object"),
            }
        }
    }

This worked on the first image of a PDF I tried it on, after which I got an error Error: Pdfium(PdfiumLibraryInternalError(Unknown)). But the important point is that the image that did manage to get extracted had the colours all wrong - a mostly red image turned blue. I'm guessing this is due to #9 and that this doesn't use the set_reverse_byte_order flag?

ajrcarey commented 2 years ago

Hi @Victor-N-Suadicani, thanks for raising this. Yes, this sounds a lot like #9, doesn't it? Pdfium internally uses BGR8 byte ordering throughout; the upstream image crate got rid of their support for BGR8 (or, at the very least, made working with it a lot harder) in their move from 0.23 to 0.24. For rendering complete pages, it is possible to tell Pdfium to flip the byte order, as you noted, by setting a flag during rendering. For getting the processed image from an image object, Pdfium provides the function FPDFImageObj_GetRenderedBitmap(), but there is no provision for passing any rendering flags.

It may be necessary to do some byte flipping manually within pdfium-render. I'm happy to implement this, preferably by using functionality provided by the upstream image crate if possible, or by writing it myself if necessary.

Can you attach the sample PDF you are working with?

I'm curious as to what might have caused the error you mentioned, but perhaps it will become clear upon examining your PDF.

Victor-N-Suadicani commented 2 years ago

I can't share the PDF unfortunately but I get the same error on the image-test.pdf you have yourself in the test directory.

Also interestingly the images there seem to turn very tiny after extraction...

ajrcarey commented 2 years ago

Ok, that's fine. I'm going to work on this tomorrow.

ajrcarey commented 2 years ago

Ok. There's at least three problems here :/

Created utils::pixels module for pixel color channel conversion. Added utils::pixels::bgr_to_rgba(), utils::pixels::bgra_to_rgba(), utils::pixels::rgb_to_rgba(), and utils::pixels::rgba_to_bgra() functions. Added correct color conversions when assigning a new image to, or extracting an image from, a PdfPageImageObject.

I'll continue tomorrow with the third problem noted above, with a view to pushing all changes by the end of tomorrow.

ajrcarey commented 2 years ago

Added new PdfPageImageObject::set_bitmap() function, to allow setting an object's image data from a BGR bitmap without performing a double color conversion to and from image::DynamicImage first. Exposed `utils::pixels::conversion functions inPdfiumLibraryBindings`. Updated README.md.

Victor-N-Suadicani commented 2 years ago

Does FPDFImagObj_GetImageDataDecoded take into account the mask (as I believe FPDFImageObj_GetRenderedBitmap is designed to do, despite being buggy)? I was hoping to be able to extract images with transparency.

ajrcarey commented 2 years ago

No idea! My initial interest is in seeing whether FPDFImageObj_GetImageDataDecoded() has the same runtime behaviour as FPDFImageObj_GetRenderedBitmap(), i.e. does _GetImageDataDecoded() fail in the same places as _GetRenderedBitmap()? And the answer is no, it does not. _GetImageDataDecoded() always returns a valid byte stream. This suggests that bitmap rendering inside _GetRenderedBitmap() is the problem.

You're not the only person to encounter this problem. Based on the tiny hint at https://groups.google.com/g/pdfium/c/V-H9LpuHpPY, I looked at the transformation matrices of every image object in image-test.pdf. And lo and behold, where either of the matrix values a and d are negative values, _GetRenderedBitmap() fails. Where those matrix values are both positive, _GetRenderedBitmap() succeeds.

This suggests that this is indeed a bug in Pdfium, and you may wish to open a bug upstream. I am open to implementing a work-around in pdfium-render, if possible. For example, simply flipping the matrix a and d values from negative to positive results in a successful call to _GetRenderedBitmap() every time. Presumably, however, this affects the processed output.

I think at this point I'm going to push the changes I've made so far, and post my test code here, so you can play around with it and have a think about what you want to do.

ajrcarey commented 2 years ago

In fact, since this _GetRenderedBitmap() problem is separate from the original reason you opened the issue - the mismatched coloring - I'm going to create a new issue. This issue will remain focussed on the color byte ordering.

Spawned new issue #52 to handle the upstream issue. Implemented unit tests for utils::pixels::* color conversion functions. Scheduled for inclusion in crate version 0.7.21.

ajrcarey commented 2 years ago

You can experiment with my changes to PdfPageImageObject by adding pdfium-render as a git dependency in your project's Cargo.toml.

My sample code is basically the same as yours, but for completeness, here it is:

use pdfium_render::prelude::*;

fn main() -> Result<(), PdfiumError> {
    let bindings =
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
            .or_else(|_| Pdfium::bind_to_system_library())?;

    let pdfium = Pdfium::new(bindings);

    let document = pdfium.load_pdf_from_file("image-test.pdf", None)?;

    for (page_n, page) in document.pages().iter().enumerate() {
        for (object_n, object) in page.objects().iter().enumerate() {
            match &object {
                PdfPageObject::Text(_) => println!("Got Text object"),
                PdfPageObject::Path(_) => println!("Got Path object"),
                PdfPageObject::Image(image_obj) => {
                    println!("Got Image object");

                    let image = image_obj.get_processed_image(&document)?;
                    image
                        .save(format!(
                            "{page_n:0>3}_{object_n:0>3}-processed.jpeg"
                        ))
                        .map_err(|_| PdfiumError::ImageError)?;
                }
                PdfPageObject::Shading(_) => println!("Got Shading object"),
                PdfPageObject::FormFragment(_) => println!("Got FormFragment object"),
                PdfPageObject::Unsupported(_) => println!("Got Unsupported object"),
            }
        }
    }

    Ok(())
}
Victor-N-Suadicani commented 2 years ago

Thanks for the quick work! Using the latest master, the color problem has definitely been fixed. I'm able to extract many more images now as well, though I still run into an error later on in my own PDF (image-test.pdf works however). The biggest problem I think is that the images extracted in this way are quite small for some PDFs, much smaller than the highest possible resolution as far as I can tell.

ajrcarey commented 2 years ago

Ok, that's good news about the colors. Pdfium returning the wrong image size may be related to #52; in any case, it is definitely a problem in Pdfium, rather than pdfium-render. I propose we mark this issue - which was with the color conversion (or lack thereof) between RGBA and BGRA, and which definitely was a bug in pdfium-render, not Pdfium - as complete, and let's discuss options for working around upstream problems in Pdfium as part of #52.

ajrcarey commented 2 years ago

As there have been no further comments on this, and I believe the original problem has been fixed, I am closing this issue. Let's continue discussion about your upstream problems in #52.

Fix for reversed byte order released as part of 0.7.21. Published to crates.io.