J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.64k stars 176 forks source link

Lopdf reports incorrect number of pages and incorrect catalog for certain PDFs #139

Open misos1 opened 3 years ago

misos1 commented 3 years ago

Lopdf fails to notice updated objects for certain PDFs.

Sample pdf file: doc.pdf

let doc = lopdf::Document::load("doc.pdf").unwrap();
println!("{}", doc.get_pages().len());
println!("{:?}", doc.catalog().unwrap());

Output:

4
<</Type /Catalog/Pages 11 0 R>>

Expected output (second line is approximate):

5
<</Type/Catalog/AcroForm 64 0 R/Pages 11 0 R>>
williamdes commented 9 months ago

AcroForm seems not implemented

Heinenen commented 3 months ago

I cannot reproduce the error with the current version 2cf9f9827959529255a855e263d4e638ef860e91. The output I get from lopdf matches your expected output exactly (except for a single space character):

5
<</Type /Catalog/AcroForm 64 0 R/Pages 11 0 R>>
williamdes commented 3 months ago

@J-F-Liu why can I not find any code references to the word " AcroForm"? is it implemented or not?

Heinenen commented 3 months ago

@williamdes There probably isn't anything special implemented, but it doesn't need to be in order to count the correct number of pages.

Most things in PDFs just rely on bery basic data types (called Objects), and from a quick look it seems that this is also the case for AcroForms.

Are you missing a feature in lopdf, or what is your use case?

williamdes commented 3 months ago

I wanted to build a pdf sanitizer, and needed to know what is implemented.

Read, filter object types, Write back to disk

Heinenen commented 3 months ago

That could be possible to do, but a lot of work on your side would ve required.

Like I said, most things in PDFs rely on Objects (most often Dictionaries), which can be read by lopdf. However, which keys/values shall be present in those dictionaries is not enforced/checked by lopdf in any way (with the exception of some special dictionaries, like the document catalog).