J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.66k stars 177 forks source link

it seems not get all decoded elements while reading a pdf generated by ghostscript 9.27 #221

Open BXHlixiaodong opened 1 year ago

BXHlixiaodong commented 1 year ago

Hi all.

I got a pdf generated by ghostscript on debian buster. I tried to parsing its page contents:

let mut doc = Document::load("./test04.pdf")?;
let page_content_ids: Vec<ObjectId> = doc
    .page_iter()
    .flat_map(|page_id| doc.get_page_contents(page_id))
    .collect();

for id in page_content_ids.into_iter() {
    let stream = doc.get_object_mut(id)?.as_stream_mut()?;
    stream.decompress();

    let content = stream.decode_content()?;

    content
        .operations
        .iter()
        .for_each(|op| println!("{:?}", op));    // here print all operators
    stream.set_content(content.encode()?); // also boxed into stream, nothing changed
    stream.compress()?;
}
doc.save("./test04.output.pdf)?;  // saved as other pdf file

here is the test04.pdf.

I found no Tj operator from stdout, and the test04.output.pdf lose some elements, test04.output.pdf is not the same with test04.pdf.

Does anyone know how to fix it? or should I use other methods base on Stream or Object? Thanks.

Heinenen commented 15 hours ago

You don't see a Tj operator because the PDF you provided doesn't contain one. It seems that the PDF is solely made up of images (of text).

That test04.output.pdf differs from the original is probably due to #78.