J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.58k stars 161 forks source link

Trailer data removed #159

Open ralpha opened 2 years ago

ralpha commented 2 years ago

When parsing a pdf there are some properties removed from the trailer (root) dictionary. These values are never added back in (even when file is saved). Is there a reason for this?

NOTE: While writing this issue and researching more I found out this might have to do with the difference that the input was a "Cross-Reference Streams".

The value following the startxref keyword shall be the offset of the cross-reference stream rather than the xref keyword. For files that use cross-reference streams entirely (that is, files that are not hybrid-reference files; see 7.5.8.4, "Compatibility with Applications That Do Not Support Compressed Reference Streams"), the keywords xref and trailer shall no longer be used. Therefore, with the exception of the startxref address %%EOF segment and comments, a file may be entirely a sequence of objects.

Source p49 PDF 1.7 Spec

Here is an example https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L244 Input (example):

<</Type /XRef/Index [0 1 2 1 18 1 23 6]/Size 29/W [1 2 1]/Root 18 0 R/Info 19 0 R/Prev 40388/Filter /FlateDecode/Length 46>>

Output (after reading):

<</Type /XRef/Size 29/Root 18 0 R/Info 19 0 R>>

But why are they explicitly removed?

While writing this post I found out this probably has to do with PDFs that use only "Cross-Reference Streams". (would be nice if someone can confirm this.) Even though the question is basically solved more me, I'll still post this for future people that might look for it.

J-F-Liu commented 2 years ago

XRef is saved in a different format, so the properties are different. Currently only reading is implemented for Cross-Reference Streams.