J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.67k stars 176 forks source link

Content decoding does not handle inline images #78

Open misos1 opened 5 years ago

misos1 commented 5 years ago

Example pdf file: bi.pdf

Screenshot 2019-09-06 at 19 06 30

Content stream contains:

100 0 0 100 0 0 cm
BI /W 4 /H 4 /CS /RGB /BPC 8
ID
00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz
EI

There is chapter 4.8.6 about inline images in pdf reference.

extern crate lopdf;

fn main()
{
    let doc = lopdf::Document::load("bi.pdf").unwrap();
    let cont = doc.get_and_decode_page_content(doc.get_pages()[&1]);
    println!("{:#?}", cont);
}
Ok(
    Content {
        operations: [
            Operation {
                operator: "cm",
                operands: [
                    100,
                    0,
                    0,
                    100,
                    0,
                    0,
                ],
            },
            Operation {
                operator: "BI",
                operands: [],
            },
            Operation {
                operator: "ID",
                operands: [
                    /W,
                    4,
                    /H,
                    4,
                    /CS,
                    /RGB,
                    /BPC,
                    8,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzzEI",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "aazazaazzzaazazzzazzz",
                operands: [],
            },
            Operation {
                operator: "EI",
                operands: [],
            },
        ],
    },
)

To handle this properly it is needed to calculate size of decoded image data from parameters like width, height, bit per component, color space and decode using filters (note "EI " byte sequence in middle of image data, there can be any byte sequence). Unfortunately there is no required "Length" key which could be used to skip stream data like in normal pdf streams.

Also this affects other functionality of lopdf which depends on content decoding like text extraction. For example there can be false positive "Tj" inside image. Or in some circumstances could lopdf return error maybe when byte sequence in image data is not valid UTF-8 string and so on.

Heinenen commented 1 week ago

Although #356 handles the given file correctly, I'd like to keep this issue open until the most common filter types in inline images are handled as well.