Open PhilterPaper opened 6 years ago
Some further thoughts...
The GSUB tables in TTF files may provide information about available non-Unicode ligatures (e.g., ttl) in some fonts, which could be used to properly uppercase such ligatures. However, since they do not have Unicode points, they will not be ligatures in the raw text code in the first place (only dynamically during output and glyph selection), so uppercasing may not have any problem. Only those ligatures and special characters (e.g., long s) with defined Unicode points will likely be a problem for uppercasing and small caps.
Note that some fonts define "petite caps", which are similar in function to "small caps", but match the x-height of lowercase letters (with small caps being slightly taller).
After exploring the GSUB and GPOS capabilities of OpenType, it appears that good practice is not to use the Unicode ligature points (except possibly for eszett, which is commonly treated as a letter), but to let the rendering system (glyph production and substitution) build ligatures on the fly. This way, letters are always discrete (e.g., 'f' and 'i') rather than already being ligatures in the source ('fi'), and can be capitalized and small-capped without worrying about dealing with ligatures. In addition, the glyph substitution code can insert the original letters for search purposes.
A possible downside is that figuring the width of a word (advancewidth) could get a bit complicated if some letter sequences are replaced by ligatures on-the-fly. At the least, the code cannot simply look up character widths by Unicode point, but has to ask the output routines if they plan to combine any letters into ligatures.
Eszett, long s, and possibly 'n/'s/'t, might still be problematic in capitalization and small caps, and require special treatment. I need to look at whether TrueType/OpenType fonts have any content that helps with determining if a given ligature (such as eszett) has an uppercase or small caps equivalent. I don't think there's any help for ligatures in core or Type1 fonts, although many do have a few ligatures.
I just pushed to the code repository some improvements to synthetic font handling ($pdf->synfont()
). Although I was able to take care of eszett (ß) folding to "SS" for small-caps (and dotless i and dotless j to I and J), I was unable to come up with a fix for ligatures and long s (ſ). The problem is that these other characters will be on alternate planes (plane 1+), and so far I have not found a way to access ASCII letters from those planes. Therefore, a ligature such as "ffi" cannot be replaced by small caps "FFI".
A random thought: perhaps at the beginning of TTF/OTF text processing, before doing anything else, detect Unicode ligatures (fi, fl, etc.) and replace them with strings of the regular lowercase letters. This would make capitalization, small caps, and petite caps work fine, and would not break if the desired ligature is missing from a font (e.g., 'ff' Unicode given, but font doesn't have it). Then (when we have support for GSUB), the font support can replace the string of lowercase characters by the matching ligature that it has found (when available and allowed).
For other kinds of fonts, we could replace ligatures by strings only if we're going to do capitalization (including small and petite caps). Perhaps it should be left up to the author as to whether to do this, as most single byte-encoded fonts don't have ligatures in the base plane (typically Latin-x), and a ligature may thus usually go missing?
Some odd characters, such as a "long s", are lumped in with this discussion. For purposes of capitalization (regular/small/petite), a long s would become S, but what do you do with it otherwise (many fonts don't support it)? If there's no means of asking the font what it supports, can we justify replacing it by 's'? Similar for eszett: how do we know if the font supports it, or its capital variant, or if we should replace it by 'ss'? This is getting into the realm of font fallbacks (#56).
Note that use of the HarfBuzz::Shaper package (textHS, etc. calls) automatically replaces lowercase letters with appropriate ligatures. This means you would not explicitly type in any ligatures, but just the component letters, and let the system decide to replace them with available ligatures. Users can already turn off (globally) ligatures, but at some point we will need to provide a way to selectively support using (if turned off, or suppressing, if turned on) specific ligatures. And if using Small Caps, ligatures should be suppressed (the entire chunk of text in SC should have -liga flag).
Ref RT 120048 (#47) discovery that some lower case characters (mostly ligatures) do not appear to have upper case equivalents. When "small caps" are used in synthetic fonts (
-caps => 1
), these are unchanged. So, a word like "field" would be in small caps "fiELD" (imagine the "fi" is the U+FB01 f+i ligature). Presumably, the desired behavior (since there is no capitalized version of this ligature) would be to replace the "fi" ligature by "f" and "i", and then small caps that. I can only speak for ligatures based on the Latin alphabet, and even then, I'm not sure about some of them (what their capitalization rules are). There are a few single characters too, which currently may or may not properly uppercase, and here they are grouped with ligatures.Right now, it's up to the user (provider of the text) to be aware of whether they are using these ligatures in their text when they are going to display small caps. One choice would be to provide a translation utility to scan the text for certain code points, and replace them by pairs or triplets of characters. The user of PDF::Builder would have to manually call this utility routine everywhere they are feeding text that might contain such lower case ligatures to output using a small caps synthetic font.
More automated would be to update the output routines to scan and translate on-the-fly, once they know that a small caps font is being used. We would have to be careful to catch all uses of a font, such as the
advancewidth()
routine, so that proper character widths are used. Finally, we could look at the actual internal structure of the small caps font, and update the font to use pairs or triplets of characters when such ligatures are encountered in the input text. This is probably the most complicated method, but would have the highest performance, and would treat such ligatures as just like any other. For example, the "ij" ligature already small caps as "IJ", so this has been done before.There are others, but are very font and language (orthography) dependent. Note that the Greek "final sigma" (terminal sigma) maps to upper case Sigma, as does sigma. Dotless i and j uppercase to various accented forms of I and J. The Dutch 't and 's, like the Afrikaans 'n, are normally not capitalized. See https://en.wikipedia.org/wiki/Capitalization and similar articles for more than you would ever want to know about capitalization rules. They are inconsistent and complicated enough across languages that it may not be worth trying to fully automate them (e.g., title and sentence caps), but still, it would be jarring to see lower case letters (or ligatures) mixed in with capitals/small caps when you have requested capitalization.
While we're on the subject, we might also want to check if a given font contains the requested ligature, and if not, replace it by the (lower case) appropriate characters. This could blend into the request (#56) for fallback fonts for missing glyphs... to either replace them with separate characters, or use a different font that does contain them. This might even be needed for ligatures (such as "oe" or "ij") that normally have upper case equivalents. Finally, uppercasing in general of such ligatures, not just for small caps, would be of interest.
This concerns only the uppercasing of certain lower case Latin alphabet ligatures (once the decision has been made to use them in the text source). Whether it is appropriate to replace a pair or triplet of lower case letters with a ligature depends upon the language, the font being used, and the word itself. For example, in English orthography, a "shelfful" of books should not use the "ff" ligature, while a "waffle" could use the "ffl" ligature. PDF::Builder should probably leave this to the author of the text. Even an automated search for ligature candidates would have to have an exclusion list (e.g., shelfful) and be aware of the language being used, subsets of that language and where a ligature would be appropriate, and the font in use (what ligatures are supported).