Content decoding does not handle inline images

Example pdf file: bi.pdf

Content stream contains:

100 0 0 100 0 0 cm
BI /W 4 /H 4 /CS /RGB /BPC 8
ID
00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz
EI

There is chapter 4.8.6 about inline images in pdf reference.

extern crate lopdf;

fn main()
{
    let doc = lopdf::Document::load("bi.pdf").unwrap();
    let cont = doc.get_and_decode_page_content(doc.get_pages()[&1]);
    println!("{:#?}", cont);
}

Ok(
    Content {
        operations: [
            Operation {
                operator: "cm",
                operands: [
                    100,
                    0,
                    0,
                    100,
                    0,
                    0,
                ],
            },
            Operation {
                operator: "BI",
                operands: [],
            },
            Operation {
                operator: "ID",
                operands: [
                    /W,
                    4,
                    /H,
                    4,
                    /CS,
                    /RGB,
                    /BPC,
                    8,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzzEI",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "aazazaazzzaazazzzazzz",
                operands: [],
            },
            Operation {
                operator: "EI",
                operands: [],
            },
        ],
    },
)

To handle this properly it is needed to calculate size of decoded image data from parameters like width, height, bit per component, color space and decode using filters (note "EI " byte sequence in middle of image data, there can be any byte sequence). Unfortunately there is no required "Length" key which could be used to skip stream data like in normal pdf streams.

Also this affects other functionality of lopdf which depends on content decoding like text extraction. For example there can be false positive "Tj" inside image. Or in some circumstances could lopdf return error maybe when byte sequence in image data is not valid UTF-8 string and so on.

J-F-Liu / lopdf

Content decoding does not handle inline images #78