PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[CTS 13] Small Caps missing for some ligatures #79

Open PhilterPaper opened 6 years ago

PhilterPaper commented 6 years ago

Ref RT 120048 (#47) discovery that some lower case characters (mostly ligatures) do not appear to have upper case equivalents. When "small caps" are used in synthetic fonts (-caps => 1), these are unchanged. So, a word like "field" would be in small caps "fiELD" (imagine the "fi" is the U+FB01 f+i ligature). Presumably, the desired behavior (since there is no capitalized version of this ligature) would be to replace the "fi" ligature by "f" and "i", and then small caps that. I can only speak for ligatures based on the Latin alphabet, and even then, I'm not sure about some of them (what their capitalization rules are). There are a few single characters too, which currently may or may not properly uppercase, and here they are grouped with ligatures.

Right now, it's up to the user (provider of the text) to be aware of whether they are using these ligatures in their text when they are going to display small caps. One choice would be to provide a translation utility to scan the text for certain code points, and replace them by pairs or triplets of characters. The user of PDF::Builder would have to manually call this utility routine everywhere they are feeding text that might contain such lower case ligatures to output using a small caps synthetic font.

More automated would be to update the output routines to scan and translate on-the-fly, once they know that a small caps font is being used. We would have to be careful to catch all uses of a font, such as the advancewidth() routine, so that proper character widths are used. Finally, we could look at the actual internal structure of the small caps font, and update the font to use pairs or triplets of characters when such ligatures are encountered in the input text. This is probably the most complicated method, but would have the highest performance, and would treat such ligatures as just like any other. For example, the "ij" ligature already small caps as "IJ", so this has been done before.

Unicode ligature replacement comment
U+00DF ss or sz ß SS German sharp s (eszett), resembles Greek beta. U+1E9E ẞ may be an acceptable upper case version (still single glyph), rather than using a double-S, but is still uncommon in fonts
U+0149 'n ʼn 'N upper case is 'N (two glyphs U+02BC U+004E) (Afrikaans, use discouraged). Capitalization rule in Afrikaans is a bit complicated
U+017F f ſ S actually a "long s" (looks like "f" without the crossbar). Technically not a ligature
U+FB00 ff ff FF
U+FB01 fi fi FI
U+FB02 fl fl FL
U+FB03 ffi ffi FFI
U+FB04 ffl ffl FFL
U+FB05 ft ſt ST actually a long s, not an f
U+FB06 st st ST short s plus t
U+A733 aa ꜳ AA Ꜳ  
U+A735 ao ꜵ AO Ꜵ  
U+A737 au ꜷ AU Ꜷ  
U+A739 av ꜹ AV Ꜹ  
U+A73B av-bar ꜻ AV-bar Ꜻ  
U+A73D ay ꜽ AY Ꜽ  
U+1F670 et 🙰 ET  
U+A74F oo ꝏ OO Ꝏ Massachusett language
U+A729 tz ꜩ TZ Ꜩ used in German
U+1D6B ue ᵫ UE  
U+A761 vy ꝡ VY Ꝡ

There are others, but are very font and language (orthography) dependent. Note that the Greek "final sigma" (terminal sigma) maps to upper case Sigma, as does sigma. Dotless i and j uppercase to various accented forms of I and J. The Dutch 't and 's, like the Afrikaans 'n, are normally not capitalized. See https://en.wikipedia.org/wiki/Capitalization and similar articles for more than you would ever want to know about capitalization rules. They are inconsistent and complicated enough across languages that it may not be worth trying to fully automate them (e.g., title and sentence caps), but still, it would be jarring to see lower case letters (or ligatures) mixed in with capitals/small caps when you have requested capitalization.

While we're on the subject, we might also want to check if a given font contains the requested ligature, and if not, replace it by the (lower case) appropriate characters. This could blend into the request (#56) for fallback fonts for missing glyphs... to either replace them with separate characters, or use a different font that does contain them. This might even be needed for ligatures (such as "oe" or "ij") that normally have upper case equivalents. Finally, uppercasing in general of such ligatures, not just for small caps, would be of interest.

This concerns only the uppercasing of certain lower case Latin alphabet ligatures (once the decision has been made to use them in the text source). Whether it is appropriate to replace a pair or triplet of lower case letters with a ligature depends upon the language, the font being used, and the word itself. For example, in English orthography, a "shelfful" of books should not use the "ff" ligature, while a "waffle" could use the "ffl" ligature. PDF::Builder should probably leave this to the author of the text. Even an automated search for ligature candidates would have to have an exclusion list (e.g., shelfful) and be aware of the language being used, subsets of that language and where a ligature would be appropriate, and the font in use (what ligatures are supported).

PhilterPaper commented 6 years ago

Some further thoughts...

  1. We need to see if ->('upper'} is replacement text (string), or just points to another code point index. If it's a string, it might be possible in the Small Caps code (SynFont.pm) to check if upper is not defined for certain ligatures, etc., and add one (e.g., eszett create 'upper' = 'SS'). If 'upper' merely points to another Unicode point, we would probably have to create a full entry for the new 'upper'. Check to see if it's a single point, or allows an array of points (functionally equivalent to a string).
  2. Don't forget that some fonts have upper case equivalents for various ligatures and other characters, while others have none at all. For those lacking an upper case equivalent, we would need to provide a series of characters (e.g., 'ij' -> 'IJ', or 'oe' -> 'OE'). In some fonts, it is possible that glyphs don't exist for upper case forms of some ligatures, even those with Unicode points.
  3. In addition to the glyphs in the table, Unicode may provide other ligatures and special characters with or without equivalent upper case forms (e.g., long s, 'n, 't, etc.). A given font may or may not provide a glyph for an upper case form (or even, the lower case form). PDF::Builder could run into the situation where the text requests, say, the 'ffl' ligature, and the font doesn't provide it. Separate letters (3) would have to be substituted.
  4. Generic uppercasing of a character or string containing ligatures and special characters is related to Small Caps (and might share code), but there is the complication that it would be font- and encoding-specific. We need to fully understand what Perl's uc() function does on non-ASCII characters (such as accented Latin characters) for various encodings and for UTF-8. It might be possible to offer an extended 'to_upper()' function, given a string, encoding, and font information.
  5. Some fonts contain ligatures (e.g., tt, ttl, etc.) that do not have Unicode points. I'm not sure how a user or application would specify these in text in the first place (giving the CID instead of a Unicode value?). We may find it useful to provide a clean way to give such ligatures in text.
PhilterPaper commented 6 years ago

The GSUB tables in TTF files may provide information about available non-Unicode ligatures (e.g., ttl) in some fonts, which could be used to properly uppercase such ligatures. However, since they do not have Unicode points, they will not be ligatures in the raw text code in the first place (only dynamically during output and glyph selection), so uppercasing may not have any problem. Only those ligatures and special characters (e.g., long s) with defined Unicode points will likely be a problem for uppercasing and small caps.

Note that some fonts define "petite caps", which are similar in function to "small caps", but match the x-height of lowercase letters (with small caps being slightly taller).

PhilterPaper commented 6 years ago

After exploring the GSUB and GPOS capabilities of OpenType, it appears that good practice is not to use the Unicode ligature points (except possibly for eszett, which is commonly treated as a letter), but to let the rendering system (glyph production and substitution) build ligatures on the fly. This way, letters are always discrete (e.g., 'f' and 'i') rather than already being ligatures in the source ('fi'), and can be capitalized and small-capped without worrying about dealing with ligatures. In addition, the glyph substitution code can insert the original letters for search purposes.

A possible downside is that figuring the width of a word (advancewidth) could get a bit complicated if some letter sequences are replaced by ligatures on-the-fly. At the least, the code cannot simply look up character widths by Unicode point, but has to ask the output routines if they plan to combine any letters into ligatures.

Eszett, long s, and possibly 'n/'s/'t, might still be problematic in capitalization and small caps, and require special treatment. I need to look at whether TrueType/OpenType fonts have any content that helps with determining if a given ligature (such as eszett) has an uppercase or small caps equivalent. I don't think there's any help for ligatures in core or Type1 fonts, although many do have a few ligatures.

PhilterPaper commented 5 years ago

I just pushed to the code repository some improvements to synthetic font handling ($pdf->synfont()). Although I was able to take care of eszett (ß) folding to "SS" for small-caps (and dotless i and dotless j to I and J), I was unable to come up with a fix for ligatures and long s (ſ). The problem is that these other characters will be on alternate planes (plane 1+), and so far I have not found a way to access ASCII letters from those planes. Therefore, a ligature such as "ffi" cannot be replaced by small caps "FFI".

PhilterPaper commented 4 years ago

A random thought: perhaps at the beginning of TTF/OTF text processing, before doing anything else, detect Unicode ligatures (fi, fl, etc.) and replace them with strings of the regular lowercase letters. This would make capitalization, small caps, and petite caps work fine, and would not break if the desired ligature is missing from a font (e.g., 'ff' Unicode given, but font doesn't have it). Then (when we have support for GSUB), the font support can replace the string of lowercase characters by the matching ligature that it has found (when available and allowed).

For other kinds of fonts, we could replace ligatures by strings only if we're going to do capitalization (including small and petite caps). Perhaps it should be left up to the author as to whether to do this, as most single byte-encoded fonts don't have ligatures in the base plane (typically Latin-x), and a ligature may thus usually go missing?

Some odd characters, such as a "long s", are lumped in with this discussion. For purposes of capitalization (regular/small/petite), a long s would become S, but what do you do with it otherwise (many fonts don't support it)? If there's no means of asking the font what it supports, can we justify replacing it by 's'? Similar for eszett: how do we know if the font supports it, or its capital variant, or if we should replace it by 'ss'? This is getting into the realm of font fallbacks (#56).

PhilterPaper commented 3 years ago

Note that use of the HarfBuzz::Shaper package (textHS, etc. calls) automatically replaces lowercase letters with appropriate ligatures. This means you would not explicitly type in any ligatures, but just the component letters, and let the system decide to replace them with available ligatures. Users can already turn off (globally) ligatures, but at some point we will need to provide a way to selectively support using (if turned off, or suppressing, if turned on) specific ligatures. And if using Small Caps, ligatures should be suppressed (the entire chunk of text in SC should have -liga flag).