fschutt / printpdf

A fully-featured PDF library for Rust, WASM-ready
https://fschutt.github.io/printpdf/
MIT License
832 stars 98 forks source link

Don't include unused fonts in the PDF document #134

Closed dnlmlr closed 1 year ago

dnlmlr commented 1 year ago

So I have seen some talk about using allsorts to do actual subsetting which would significantly reduce the PDF size. This is not actually removing glyphs but at least it is omitting completely unused fonts from the PDF output.

I am using the PdfLayerReference::set_font function to mark fonts as used and then skip unmarked fonts in FontList::into_with_document. This is a rather simple hack, but it allows for adding all the fonts you want without wasting space on fonts / font variants that are not used at all. The main usecase for me is with the genpdf create where it is required to add all 4 variants of a font (regular, italic, bold, italic-bold) even if not all of them are used.

Let me know if there is something I missed here which could cause problems

dnlmlr commented 1 year ago

So I actually looked a bit into subsetting with allsorts and managed to subset the fonts externally before adding them to the PDF already. This is not optimal since it is not using the PDF properties for subsetting and instead basically just creates a new font, but as long it reliably works for reducing the filesize I'd say it is an option.

There currently is a problem with the allsorts subsetting implementation that causes the output font to be missing a few of the data tables. I found that it still works in the PDF files on all devices and programs I tested so far, as long as the font uses the Unicode type for the cmap entries.

I'll look a bit more into this and do some more testing for my own project. Let me know if you would be interested in an implementation of this into the crate. A pretty simple implementation could be to save all chars that get printed into a hashset linked to the current font. Then the font could be trimmed before writing it into the output PDF. That's probably not the most performant way to do this, but it might be fine. Especially if it is an optional feature.

Edit: I just saw that the PDF text operations mostly use the codepoints directly and not the characters. That of course makes this more difficult. In theory it would still be possible to somehow remap the codepoints before finally writing the output PDF, but that seems a bit more tedious

dnlmlr commented 1 year ago

Ok I did in fact manage to get automatic subsetting working in printpdf using the allsorts. I implemented this on the main branch of my fork since I needed the feature for my own program.

My implementation runs when ExternalFont::into_with_document is executed. It works by scanning all layers to collect used the glyphs for the current font. Then the font subsetting is executed which can be used to also produce a mapping table from old to new GIDs. That mapping table is then applied to all texts in all layers where the font was used. And of course the subset font gets saved to the PDF file.

Since it was honestly kind of a hassle to work with the current codebase, I merged the currently open PR #131 before implementing this feature. If that gets merged, it would be pretty easy to integrate

fschutt commented 1 year ago

@dnlmlr merged, can you rebase / fix? thanks

dnlmlr commented 1 year ago

I rebased my current allsorts-based subsetting on master and pushed that for this PR. Please check if this implementation is Ok for you, as it is quite a bit more complex than the previous one that just removed completely unused fonts. The whole subsetting and the inclusion of the allsorts crate is locked behind a feature flag. If the feature is not enabled, there should be no impact on performance or any other metric.

This also contains another cargo fmt pass and an update to the current rust 2021 edition.

fschutt commented 1 year ago

lgtm, although I think I should slowly work towards a proper data model for PdfPage, so that manipulation becomes easier