[RT 123470] Embedding Fonts

PhilterPaper commented 6 years ago

Tue Oct 31 16:43:54 2017 NHORNE [...] cpan.org - Ticket created Subject: Embedding Fonts

Some systems, such as prestoprint, require embedding of fonts. How do I embed these fonts into the output PDF?

Times-Bold
Times-Italic
Times-Roman

PhilterPaper commented 6 years ago

I haven't tried it out yet, but TrueType support includes a -noembed option. Presumably it embeds a TT font unless you ask it not to. Perhaps this could be extended to other font types (Type1, at least). It's not clear whether core fonts can be embedded, as these fonts are normally installed with any reader.

The ability to embed fonts is an important part of PDF/A (RT 120375/#52).

Also discussed in ~~[/forum/pdf-builder-general-discussions/font-families/]~~

PhilterPaper commented 6 years ago

Note that all fonts, not just core fonts, suffer from the issue of having different versions on the PDF producer's (writer) machine and the consumer's (reader) machine. There may be different glyph widths, affecting justification, and even missing or added glyphs. In addition, the use of single byte font planes (via the automap method) cannot be guaranteed to produce consistent results (even for core fonts), as there might be a different set of glyphs between the two machine's font files and metrics information. Thus, embedded fonts can solve this problem, at the cost of much larger PDF files.

Add: note that "core" fonts vary by implementation. There are 3 typefaces (Times, Helvetica, Courier) x 4 variants (Roman, Italic, Bold, Bold-Italic) plus 2 symbol typefaces for 14 basic core font files. Windows implementations may add another 15 or so font files. Implementations are free to replace a basic core font with another (with similar metrics). Thus, for a consistent look, it would indeed be very good to embed core fonts in a PDF document, unless there are copyright/licensing issues or space is very tight.

PhilterPaper commented 6 years ago

With core fonts, in particular, only the CP-1252 (Latin-1+) glyphs and their widths are defined. This leads to non-Latin-1 glyphs getting the "missing width" assigned width, which is much too narrow and causes character overlap (see #7). Reading the font file for widths, and even better, embedding it in the PDF, should fix this problem.

PhilterPaper commented 6 years ago

Sat Jun 02 18:02:37 2018 The RT System itself - Status changed from 'new' to 'open' # Sat Jun 02 18:02:38 2018 steve [...] deefs.net - Status changed from 'open' to 'resolved'

PhilterPaper commented 5 years ago

Note that this ticket is closed as resolved on the PDF::API2 queue, but is still open here. I think the whole subject of embedding fonts in a better manner (code and documentation) is a good one, so I will leave this ticket open as a placeholder, even though the original question has already been answered (more or less).

PhilterPaper commented 4 years ago

Note that .ttf and .otf fonts are automatically embedded (unless suppressed with -noembed flag). Furthermore, only the subset of glyphs used is embedded (unless suppressed with -nosubset flag).

Unfortunately, core fonts, PS (Type1) fonts, CJK fonts (even if .ttf or .otf file), and probably all other non-TrueType/OpenType fonts are not embedded at this time. This ticket is being kept open as a reminder to think about expanding font embedding in the future.

PhilterPaper commented 3 years ago

Besides natively caching (embedding) fonts as TTF/OTF does, there is the possibility of attaching a font file, although I don't know if this will satisfy PDF/A or indeed, will be usable as an "embedded" font (that any PDF reader will know how to detach and use it). It also remains to be seen whether a subset of the font could be embedded. In the end, it might be better to do something similar to the TTF/OTF embedding.

terefang commented 1 year ago

Besides natively caching (embedding) fonts as TTF/OTF does, there is the possibility of attaching a font file, although I don't know if this will satisfy PDF/A or indeed, will be usable as an "embedded" font (that any PDF reader will know how to detach and use it). It also remains to be seen whether a subset of the font could be embedded. In the end, it might be better to do something similar to the TTF/OTF embedding.

no attaching is not the same as embedding and the reader simply does not know how to use an attached font file

PhilterPaper commented 8 months ago

Embedding the actually-used subset of characters will at least allow read-only usage of a document. If the document is intended to be editable (I'm not sure if, and how well, PDF::Builder implements this), more glyphs may be needed, but you may not want to embed the entire font due to its size (and its license may not allow it). Thus, a means to embed the used subset and a complete subset of some portion of the font might be in order. For example, "embed the full ASCII" or "embed the full Latin-1" might be done by initializing (preloading) the "subset" glyph array to be 7-bit ASCII or 8-bit Latin-1, and then in the course of processing, add in anything else needed. This would be under user control (ASCII might be the default).

If you plan to use character sets beyond Latin-1, you would need a way to specify the extra sections of glyphs (not necessarily just Unicode points). For instance, you are writing something in Devanagari (Indic) script, and you want someone to be able to type in other Devanagari characters (as well as all the ligatures and stuff that HarfBuzz::Shaper pulls in). Or it's a document in Greek, and you want users to be able to enter a reasonable [sub]set of other Greek characters. After specifying a font, the author would need to specify in some way the character set range (both Unicode and glyphs) they wish to be embedded.

Of course, they need to be able to mix multiple ranges if they are mixing a number of writing systems. It should probably be tied to the font in use, rather than globally pulling in the same subsets for all fonts, and authors will probably want to be able to define named subsets they would like to have at hand (e.g., "Greek"), rather than having to explicitly define them each time.

terefang commented 8 months ago

there are actually several points to this:

the platforms regional/local/language settings
the encoding the developer specified for loading the font (if any)
the actual encoding of the source text (if different from utf8)
the language tagging of the source text (if any)

let me just give myself as an example:

i use "en_IE.utf-8" since it is the most EU-ish english locale which is overridden for some language settings by "de_AT" my native Austrian-German language setting.
i my default settings for embedding fonts is either "unicode" (ie all) or "pdfdocext" (which actually does not exist, but is a custom encoding starting from "pdfdoc" adding missing glyphs from Latin-15, the euro-sign and some from TeX-Ansi)
my source texts are always utf-8, i simply refuse to work with any other.
my source texts are almost never language tagged, and i have also never seen usage of RFC2482 in UTF-8 documents.

for explanation: "pdfdocext" only includes latin-ish glyphs well enough supported from the base14 set, disregard Symbol and Dingbats for that matter.

terefang commented 8 months ago

i have since played around with the idea of using some other font collections as substitutes.

ADF – primarly focus on Linux and TeX, not much glyphs outside Latin1/15 – try to stay away from them, unless you like the design and can live with the limited glyphset.
URW aka Base35 – still the base fonts for Ghostscript, i would consider it as LGCM (Latin+Greek+Cyrillic+Math) with its 800-ish glyphs.
DejaVu – if you can live with non-CJK and odd straying from base14 looks, glyph-support is an awesome of 3000+ (6000+ in sans).
CrosCore – Arimo, Cousine, Tinos, and Noto ... the looks of Android, i would consider it as LGCEHM (Latin+Greek+Cyrillic+Extended+Hebrew+Math) with its 3000-ish glyphs and Noto has every other language supported as a fallback.

If you need to satisfy PDF/A and/or PDF/X and need Greek or Cyrillic support in the Base14 set, the Base35 set is the next best thing to be considered as a substitution embedding.

the CrosCore fonts have distinct looks and cannot be considered look-alike substitutes for the Base14, but Noto makes an awesome fallback font.

PhilterPaper commented 8 months ago

Thank you for your comments, but I'm not sure we're on the same page here. What I'm looking for is ensuring that if a document is editable (which, as I mentioned, I'm not sure about PDF::Builder's capability here), that a user can ensure that a user has a reasonable selection of characters and other glyphs to type in any new content desired, given the font(s) used. I'm not really looking to have an enormous selection for a large number of writing systems. For example, input might be in English, but your initial text (and thus the subset embedded) doesn't include letters 'q' or 'z'. Someone wishing to write some names or text using those letters may be out of luck, unless most Readers have some mechanism for a local fallback font. It would be a good idea to fill out (at a minimum) the ASCII set, and maybe Latin-1 or even Windows.

You certainly don't want to bloat the document size with an embedded full UTF-8 set! Some users who want to jump into other writing systems (e.g., an Indian-American who wants to write the name of their home town in Devanagari) may just be out of luck here, and have to stick to Western alphabets. A Reader that permits arbitrary additions to the embedded fonts would be great, if such a thing exists. Don't forget that the updated PDF needs to be saved, and readable by others, so the embedded font subset would need to be expanded. And if the local font used to expand the embedded font isn't quite the same as the original font...

terefang commented 8 months ago

Thank you for your comments, but I'm not sure we're on the same page here. What I'm looking for is ensuring that if a document is editable [...]

so you always need to embed more than the set of glyphs used in the text.

I'm not really looking to have an enormous selection for a large number of writing systems. For example, input might be in English, but your initial text (and thus the subset embedded) doesn't include letters 'q' or 'z'. [...]

as for the additional glyphs to be embedded, you need to have a set of rules on how the choose that additional glyphs.

the standard subsetting chosen by PDFAPI2 is doing the bare minimum, just embedding the glyphs used.
one hard and fast rule could be that the "pdfdoc" glyphset should always be available (ASCII+Latin1/15) → works ok for T1 and TT.
the next rule would be all glyphs for the character set the user specified (even if all the text is in UTF8) → starts to cause problems with T1 fonts, still working in TT, but quirky for CJK.
the next rule would be all glyphs for the language the user specified (even if all the text is in UTF8) → even more problems in T1 fonts, still working in TT, but causing bloat for CJK.
a purely technical solution (without having definitions of character-set and language) would be to track all characters used in the text and mark all glyphs to be embedded contained in the related 256 character segment → impossible with T1, still working with TT, quirky but working with CJK. \ Example: if a text uses ASCII and the characters 0x0134, 0x2034 the following unicode sets will be embedded 0x0000-0x00ff, 0x0100-0x01ff, 0x2000-0x20ff.

You certainly don't want to bloat the document size with [...]

yes, bloat needs to be be avoided, but sometimes needs to be balanced against usability.

terefang commented 8 months ago

let me describe what i do in my current pdflib, i develop on the java platform.

Unicode Mode – TrueType or OpenType fonts having TrueType or CFF outlines, using CID objects, no subsetting, full embedding
TrueType Mode
- TrueType or OpenType fonts having TrueType, using TrueType objects with 8bit Encoding, subsetting to the specified Charset
- TrueType or OpenType fonts having CFF outlines, using TrueType objects with 8bit Encoding, no subsetting, full embedding
Type3 Mode
- SVG Fonts, using Type3 objects with 8bit Encoding, embedding glyph outlines as xobjects
- AWT Fonts, using Type3 objects with 8bit Encoding, embedding glyph outlines as xobjects
Type1 Mode
- Postscript Fonts (PFB/PFA), using Type1 objects with 8bit Encoding, no subsetting, full embedding
- AFM Font Metrics (AFM), using Type1 objects with 8bit Encoding, no embedding

i have never looked back going back to tracking glyphs in text – too much a hassle, no enough bang for the buck.

PhilterPaper commented 8 months ago

Some interesting thoughts and approaches. Unless a font is quite small, I am loath to embed the whole thing (assuming licensing permits). It's of course possible to embed nothing, in which case you're dependent on the Reader having the correct font available locally (is it possible to specify a list of acceptable substitute fonts to be used by the Reader? Character codes would have to be in a given encoding, and I don't know what you would do about ligatures, differing metrics, etc.). You can already embed just the subset of glyphs used, which permits display of fixed text, but of course this presents the danger that a user can't update or add to the text if they need additional glyphs to do so.

What I'm aiming for here is to provide a large enough subset of a given font that it is likely that a user, working in the same (or a compatible-alphabet) language, has a reasonable chance of having sufficient glyphs on hand in the embedded font to enter the text they want to. Straight ASCII is enough for most English-language users, while most European languages could find Latin-1 or Win ANSI sufficient (and there's still the issue of ligatures and the like for both). Anything beyond that will have to be specified by the document creator as a different 8-bit encoding or some subset of UTF-8. It would be unusual to require both Viking runes, Canadian aboriginal script, and Greek text to be enterable, but those ranges could be specified. I think that a full 7-bit ASCII would be the bare minimum, but 8-bit Win ANSI (Latin-1 plus "smart quotes") would be a reasonable default.

i have never looked back going back to tracking glyphs in text – too much a hassle, no enough bang for the buck.

If I understand what you mean by "tracking glyphs", you mean just embedding the actually used glyphs (embed yes, subset yes in the current product)? Well, yes, it is more work than simply embedding a fixed, selected set (not expanded by any new glyphs, although that can be done by treating the fixed set as just the starting point for the cache of glyphs to subset). Since the existing product already has code in place to track glyphs and add them to a subset list (at least for TTF), other than a minor performance gain, it would probably not be worthwhile pulling that out.

terefang commented 8 months ago

[...] is it possible to specify a list of acceptable substitute fonts to be used by the Reader? [...]

not directly

the FontDescriptor contains a bit-field attribute (Flags) specifying various characteristics for substitute fonts, but its usage is highly implementation dependent.

What I'm aiming for here is to provide a large enough subset of a given font that it is likely that a user, working in the same (or a compatible-alphabet) language, has a reasonable chance of having sufficient glyphs on hand [...] I think that a full 7-bit ASCII would be the bare minimum, but 8-bit Win ANSI (Latin-1 plus "smart quotes") would be a reasonable default.

that is along the lines what i try to do with the pdfdoc glyphset.

i have never looked back going back to tracking glyphs in text – too much a hassle, no enough bang for the buck.

If I understand what you mean by "tracking glyphs", you mean just embedding the actually used glyphs (embed yes, subset yes in the current product)? Well, yes, it is more work than simply embedding a fixed, selected set (not expanded by any new glyphs, although that can be done by treating the fixed set as just the starting point for the cache of glyphs to subset). Since the existing product already has code in place to track glyphs and add them to a subset list (at least for TTF), other than a minor performance gain, it would probably not be worthwhile pulling that out.

i ment this in the context of wrintint a new implementation and not ripping out old working code, although not supporting subsetting would make much of the code a lot shorter and simpler.

PhilterPaper commented 8 months ago

OK, I think we're having a violent agreement here. It sounds like we work towards more-or-less the same end goal, with possibly some minor differences (fixed subset of the font rather than just using that subset as the starting point of the subset, and adding in any additional characters/glyphs encountered?).

Nothing's going to happen on the PDF::Builder end until I get around to checking whether (and how well) it supports user editing (forms, etc.) of text, rather than just read-only documents. I really haven't gotten into this area, and it's possible that there's a lot of work to be done before it's worth even thinking about font subsets!

PhilterPaper / Perl-PDF-Builder

[RT 123470] Embedding Fonts #80