[CTS 5] Fallback fonts (font glyph substitution)

PhilterPaper commented 7 years ago

Subject: CTS 5 - Fallback fonts (font glyph substitution) March 17, 2017, 04:40:39 AM by sciurius

I'm not sure if this is the right thread, but anyway...

In this modern Unicode era we run into the problem that there are many more symbols to be shown than are present in a particular font. For example, when dealing with music I want to show a sharp ♯, flat ♭, delta Δ, and so on. I notice that many applications, like the web browser I'm typing this in, use font glyph substitution using fallback fonts if necessary. The glyphs are borrowed from a different font. Google has developed free fonts (the Noto-fonts) to complement its standard font (Roboto) with virtually all glyphs defined in Unicode.

I think supporting fallback fonts is a must for a future-proof PDF::Builder. # March 17, 2017, 09:24:13 AM by Phil

Indeed, this would be the right place to request this feature, and it's a good one. Browsers have done this for quite a while, first looking for a glyph in the first requested font-family, then the second, etc. until the default fallback. Actually, come to think of it, I'm not sure what happens if you explicitly list one or more font-families in CSS and the glyph is not found in any of them — does a browser try its own list of fonts in a desperate attempt to find something instead of a tofu? For PDF generation, that might be something configurable.

What is the industry standard for mapping a requested (but missing) glyph to something else, especially if they don't happen to have the same CId or Unicode code point or a standard name? The original encoding would say what the character is, but in a fallback font it could be at a different code point or under a different name or ID. # March 17, 2017, 03:00:43 PM by sciurius

Industry standard:

HarfBuzz is an OpenType text shaping engine.

The current HarfBuzz codebase [...] is stable and under active maintenance. This is what is used in latest versions of Firefox, GNOME, ChromeOS, Chrome, LibreOffice, XeTeX, Android, and KDE, among other places.

PhilterPaper commented 7 years ago

See also #94. This is a somewhat related discussion for PDF::Builder knowing what font variants are available for each supported font, and being able to specify font variants (bold, italic, small caps, etc.) in a simple and consistent way. It might even involve going to another font, or building a "synthetic font", to get unsupported variants. Someone would have to specify a list of fallback fonts to look at, before either crudely synthesizing a variant (e.g., font size change for small caps, or overprinting (with offset) for bold) or just giving up.

PhilterPaper commented 7 years ago

One thing to keep in mind is that a requested glyph is going to be known only by its code point and encoding (xFF in Latin-1 is different than xFF in CP1253, and isn't valid in UTF-8). At that point (knowing the output encoding of a text string), each character could be checked against a list of supported glyphs for a font, and a request to generate an alternate font could be made. Unfortunately, this sounds like it could be quite costly in time and/or space, but that may be the price to pay, especially if not all operating systems offer utilities to do this. Another complication would be that for some glyphs, you might want one particular fallback font, while for others, you might desire a different fallback font -- a global list of fonts may not produce the best results.

Anyway, before proceeding with anything, we need to give this matter a lot of careful thought, to make sure we're covering as many wishlist items as possible with the minimum effort and code.

terefang commented 4 years ago

have you looked into unifont.pm ? this is what this module is for.

PhilterPaper commented 4 years ago

I don't think this is what unifont() is intended for. As far as I can tell, unifont specifies that a certain opened font is to be used for a certain range of Unicode points. The request made by Johan is for font fallback, if a requested Unicode point does not have a glyph available in the current font. A browser does this, looking down a list of desired typefaces (font families) until a glyph is found.

I suppose that if you know in advance that a desired glyph will be unavailable in your chosen font, that you could select (via unifont) an alternate font to use for that Unicode point, but that's not quite what was requested (and involves a lot more manual work).

@terefang thanks for joining in the discussion, and I hope to see more input from you!

terefang commented 4 years ago

@PhilterPaper

yes, you are right that (with plain unifont.pm) you have to know in advance the codepoints to fallback to.

nonetheless are the introspection capabilities of the font objects good enough for this, so the code to be written wont get too complicated.

the real question would be which font to fallback to.

PhilterPaper commented 4 years ago

The original request (from Johan) is for HTML/CSS style to go through a list of fonts (let's assume italic and bold are dealt with properly), in the indicated order, until a font is found that has a glyph for the desired Unicode point. In other words, give a list of most desirable to least desirable typefaces/fonts and the Unicode text to set, and hopefully most of the time you'll get glyphs from the most desired typeface, with occasional fallbacks to the less desired typefaces.

Anyway, it shouldn't be terribly hard to do such a thing. The code just needs to detect that there is no glyph (CId) for the Unicode character, and (rather than outputting a blank or an empty box) go down the list of alternate fonts, opening up the appropriate ones and checking if the desired glyph exists. This might be a good extension to Johan's Text::Layout, which already concerns itself with keeping track of the desired typeface and whether it's bold and/or italic (among other things). IIRC you have to pre-open all the fonts you'll use, so worst case you'd have to open some more for alternates.

While we're at it, the unifont() method of specifying different lists of specific opened fonts for ranges of Unicode characters (single-byte encodings too?) might be blended in.

PhilterPaper commented 2 years ago

Something related would be a way to query the system as to whether a given Unicode item is available in the given font, variant, and weight. For example, \iiint (for math use) should be a triple integral, but a math rendering system might have to build it up out of three integrals (long s's) if a triple isn't in the font. It would probably be too much trouble to have the system automatically build a missing glyph out of pieces it has on hand, as many hundreds (if not thousands) of combinations (including the infamous n-umlaut used by Spin\:al Tap) would need to be covered. Try typesetting Vietnamese some time!

I don't know if it is worth dealing with specific glyph IDs, as these are likely to change from font file to font file, and even different release versions of a font, meaning that code would be tightly tied to a specific font file.

terefang commented 2 years ago

you can basically query a truetype font if a particular unicode-point is available (via cmap).

since there are so many bad or broken fonts out there, i never bothered to implemented it in unifont(), but rather have the user specify what she needed.

as for assembling missing glyphs from a combination, this only works for a very systematically constructed font but tends to break down very rapidly otherwise. (i used the equivalent in fontforge and 90% of the time needed to do tweaking)

PhilterPaper commented 2 years ago

(Let me cc @sciurius on this in case he has anything to add to the discussion. He mentioned that HarfBuzz::Shaper might have some utility for finding a font with a given Unicode point. I'm not sure he's put that into Text::Layout.)

For a font which has already been loaded, a method such as glyphByUni($UnicodePoint) might work (returns .notdef if this code point is missing). The question is whether you want to go ahead and load possibly a large number of other fonts, to find the first with the desired code point present. It's not a horrendous burden on the system, but not a great thing to have happen. We would have to make sure that a rejected font (not used) doesn't get loaded into the PDF.

Can we discover code point availability before loading in a given font? Something would be needed to read a given font file (presumably TTF/OTF) and see what's in it, without formally loading it into PDF::Builder (though that is possible). We already prereq Font::TTF, I wonder if it has any capabilities here?

Add: Let's not forget that it's likely that more than one code point will need to be searched for a page (or document). Presumably we would keep track of a given missing code point so we don't have to keep searching through fonts again and again. But what to do when another missing code point shows up? It would be inefficient to repeat the search process all over again. To some extent, we will have committed to loading a second (or third...) font for previous missing code points, so the chances of finding a code point within an already-loaded font increase, but the supplementary font(s) may not be the most desired font, and the missing code point might be found in a font skipped over before (i.e., an earlier missing code point caused the second or third or fourth choice of fonts to be loaded).

Also keep in mind single-byte encodings (T1, core, etc. fonts) might have the desired glyph (code point) somewhere in the font, but not in the current encoding. Per #81, we might fill up one subfont with 256 entries, and then go on to create another subfont with a further 256 entries (basic ASCII likely to appear in both).

sciurius commented 2 years ago

As far as I know, Font::TTF already delays loading the font tables so it should be possible to inspect the relevant tables without 'loading everything'.

PhilterPaper commented 2 years ago

Another thread on the subject: sciurius/perl-Text-Layout#6

PhilterPaper commented 1 year ago

Just thinking about the practical aspects of doing fallbacks (looking in another font if the requested glyph is missing from this one). First, the lookup should be cached so that if a character is requested a second time, its fallback (or primary) can be rapidly selected. Second, can the ASCII character set be assumed to be present in the current font? I suppose that there could be specialty fonts (especially decorative or symbol ones) with only a limited subset of ASCII in them. Third, the glyphs would have to be embedded in the PDF file (probably a good idea anyway), as we can't depend on an external font file to have the same glyphs, or to be in the same place! Where does this leave core and PS/T1 fonts? How about non-UTF8 encodings? Can we depend on non-TTF fonts (not embedded) always having the same glyph set?

Font Manager can be used to easily switch between fonts when looking for the glyph for a Unicode point. I presume that at some point in the search, the font would actually have to be loaded, in order to inquire whether there's a glyph for that point. If a font in the fallback list ends up not being used at all, there should be a way to automatically discard it (assuming it would otherwise be incorporated into the PDF file).

terefang commented 1 year ago

let me add my thoughts ... in no particular order.

first and foremost – not embedding a font is bad as it breaks PDF/A and PDF/X assumptions and users usually dont share their fonts with their PDFs.
give the user/programmer the option to always embed the fallback fonts (probably subset).
if the fallback font is less than 100kb, embed it with flate compression as nobody will care to be bothered.
if only a few glyphs are used for fallback (<256), you can create a synthetic Type3 font out of the glyphset – this will also allow you to combine glyphs from multiple fallback fonts.
if the fallback font is also referenced as a normal font the need to embed it is clear.
if a fallback font is reference multiple times, either embed it (subset) or go the Type3 route.
you basically cannot assume a basic ASCII glyphset to be present, except in really old broken 90s ttfs.

most usecases that i have encountered will :

either combine multiple encoded fonts to form a kind of Unicode font
combine an ASCII/Greek/Cyrillic encoded font with an Unicode fallback font
combine a writing font with an icon font (like dings or awesome)

hope that helped

PhilterPaper commented 1 year ago

Thank you for the thoughts -- they will probably be useful. I'm surprised to hear that I cannot count on ASCII being present in a given font (I guess it's not mandated?).

Don't forget to look at #80 and #81 for a more complete look at the subject of font handling.

terefang commented 1 year ago

T1/T3 fonts are encoded in a single-byte encoding that might or might not include ASCII in 0x20-0x7F (ie depending on whats included in the AFM/PFM file).

for TTF/OTF that depends on their embedded CMaps if ASCII glyphs are available or not. if a Symbol CMap is present like it is for WingDings and many newer iconic fonts, chances are pretty high that ASCII glyphs are not present.

in the 90s, people would use various bad font tools to reencode such fonts (including me) to bring their glyphs into a single-byte range via (duplicate) compatiblity entries in the CMap. (eg. mapping 0xE?? or 0xF?? to 0x00??)

it basically becomes a 80:20 gamble ... the standard code path with default options should be sufficient for 80% of the usecases. the remaining 20% will need to set proper options for the processing and maybe need to deep-dive into the fonts they use and determine if they are new/good enough or so crappy that they need to be replaced.

i have also seen the Nerd Fonts Project which simply patches entire font collections to include missing glyphs. (ie in their case icons)

terefang commented 7 months ago

@PhilterPaper i just read the following text in the PDF1.7 spec:

The results of the CMap mapping algorithm are a font number and a character selector. The font number shall be used as an index into the Type 0 font’s DescendantFonts array to select a CIDFont. In PDF, the font number shall be 0 and the character selector shall be a CID; this is the only case described here. The CID shall then be used to select a glyph in the CIDFont. (PDF 32000-1:2008 – PDF 1.7 – page 280)

do i interpret that correctly that previously it was possible to construct a Type0 font with multiple DescendantFonts and one could concstruct a CMap to select the CIDs from it ?

also reading section 6 of 5014.CIDFont_Spec would suggest so.

terefang commented 7 months ago

just verified this in Poppler source code that it only uses the first entry in the array

PhilterPaper commented 7 months ago

do i interpret that correctly that previously it was possible to construct...

You're asking me? I know nothing about such things :-) -- I'm counting on people like you to be the experts!

terefang commented 7 months ago

do i interpret that correctly that previously it was possible to construct...

You're asking me? I know nothing about such things :-) -- I'm counting on people like you to be the experts!

i asked around in my local commune of STEM people and we have come to the conclusion that:

it should be possible in Postscript
it might have been planned for inclusion in PDF
the language of the ISO-32000 spec suggests a deprecation of the feature in PDF
looking at several pdf sources seem to confirm the third point

this is a sad day for me ... that would have been a very useful feature

PhilterPaper / Perl-PDF-Builder

[CTS 5] Fallback fonts (font glyph substitution) #56