J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.6k stars 165 forks source link

Something weird with some pdf (valid for opening and printing) in lopdf 0.20.0 #134

Open stephaneworkspace opened 3 years ago

stephaneworkspace commented 3 years ago

Hello I found someting weird in some pdf file generated with PDFCreator 2.1 here is the pdfforge/PDFCreator: (github.com) but it’s the version 4.2.0, but for solving this problem, update the pdf creator version is not the solution because all old pdf have this problem. So… I linked to this issue 3 file generated with libre office → print → pdfcreator and files generated with libre office → generated pdf

After analysing in rust, for me, with the pdf creator, the format seems to be wrong, I solved the problem with a solution in this issue below.

I added a sw_debug and sw_pdf_creator in example of in main function at lopdf/examples/merge.rs

so… let sw_debug = true ; // diplay println! And let sw_pdf_creator = true ; // for solving pdf_creator

And :

    let mut j = 0;
    for mut document in documents {
        document.renumber_objects_with(max_id);

        max_id = document.max_id + 1;

        if sw_debug {
            for d in document.clone().get_pages().iter() {
                println!("{:?}", d);
            }
        }

        // Solving pdf_creator (min_value needed)
        let mut min_value = 0;
        if j > 0 {
            for d in document.clone().get_pages().iter() {
                if min_value == 0 {
                    min_value = d.1.0;
                    if sw_debug {
                        println!("debug: {:?}", d.1.0);
                    }
                }
                if d.1.0 < min_value {
                    min_value = d.1.0;
                    if sw_debug {
                        println!("debug: {:?}", d.1.0);
                    }
                }
            }
        }
        j += 1;
        if sw_debug {
            println!("j:{} {}",&j, &min_value);
        }
        if min_value > 0 {
            min_value -= 1;
        }

        documents_pages.extend(
            document
                .get_pages()
                .into_iter()
                .enumerate()
                .map(|(i, (_, object_id))| {
                    let object_id_mod = if i == 0 {
                        if sw_pdf_creator.clone() {
                            (min_value, object_id.1)
                        } else {
                            (object_id.0, object_id.1)
                        }
                    } else {
                        (object_id.0, object_id.1)
                    };
                    if sw_debug {
                        println!("mod: {:?} i: {}", object_id_mod, i);
                    }
                    (
                        object_id_mod,
                        document.get_object(object_id).unwrap().to_owned(),
                    )
                })
                .collect::<BTreeMap<ObjectId, Object>>(),
        );
        documents_objects.extend(document.objects);
    }

With merge of pdf_creator_a.pdf, pdf_creator_b.pdf and pdf_creator_c.pdf

I have in :

        if sw_debug {
            for d in document.clone().get_pages().iter() {
                println!("{:?}", d);
            }
        }

pdf_creator_a.pdf : (1, (21, 0)) (2, (1, 0)) (3, (6, 0)) (4, (11, 0))

pdf_creator_b.pdf : (1, (52, 0)) (2, (32, 0)) (3, (37, 0)) (4, (42, 0))

pdf_creator_c.pdf : (1, (83, 0)) (2, (63, 0)) (3, (68, 0)) (4, (73, 0))

With the mod trick :

                    if sw_debug {
                        println!("mod: {:?} i: {}", object_id_mod, i);
                    }

pdf_creator_a.pdf : mod: (0, 0) i: 0 mod: (1, 0) i: 1 mod: (6, 0) i: 2 mod: (11, 0) i: 3

pdf_creator_b.pdf : mod: (31, 0) i: 0 mod: (32, 0) i: 1 mod: (37, 0) i: 2 mod: (42, 0) i: 3

pdf_creator_c.pdf : mod: (62, 0) i: 0 mod: (63, 0) i: 1 mod: (68, 0) i: 2 mod: (73, 0) i: 3

No problem with pdf generated with libre office without pdf_creator (merge libre_office_a.pdf libre_office_b.pdf and libre_office_c.pdf) :

libre_office_a.pdf : (1, (1, 0)) (2, (4, 0)) (3, (7, 0)) (4, (10, 0))

libre_office_b.pdf : (1, (23, 0)) (2, (26, 0)) (3, (29, 0)) (4, (32, 0))

libre_office_c.pdf : (1, (45, 0)) (2, (48, 0)) (3, (51, 0)) (4, (54, 0))

libre_office_c.pdf libre_office_a.pdf libre_office_b.pdf pdf_creator_c.pdf pdf_creator_b.pdf pdf_creator_a.pdf

Thanks

genusistimelord commented 3 years ago

PDF_Creator does not re-correct the ID's before saving them which is what a proper PDF builder should be doing as Page 1 should not be ID 21 and page 2 being ID 1. The only way to fix this would be to add a ID swap feature which on load would see what ID's the pages are and swap them in the correct order. so it does not generate Madness when they are merged.

Heinenen commented 1 month ago

Maybe I'm misunderstanding something, but what is the problem? Having randomly assigned IDs isn't really a bad thing in PDFs, we have the Xref table to look them up in constant time anyway.