[cmap] Documentation of cmap table lacks information on normalization

NorbertLindenberg commented 4 years ago

The documentation of the cmap table lacks information on the Unicode normalization that font consumers should be applying while using cmap tables to map characters to glyphs, and that font producers can rely on when constructing cmap tables and lookups in GSUB and GPOS tables.

Unicode normalization is relevant in two ways:

– It defines canonical decompositions of characters, such as U+00E4 “ä” to U+0061 “a” + U+0308 “◌̈”. In the absence of information about normalization, font producers have to provide entries for both precomposed and decomposed forms, and then either handle both in subsequent lookups or apply their own normalization.

– It defines canonical ordering of marks, which brings sequences of certain marks (those with a non-zero combining class) into a defined order. In the absence of information about normalization, font producers have to be prepared to handle such marks in any order in ligature or contextual lookups, and font consumers such as shaping engines have to be prepared to handle such marks in any order in cluster validation (see bug #568 for an example of their failure to do so).

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 86923ff4-2b60-9144-e2b4-77cdb36fd698
Version Independent ID: ea221e53-f51a-8233-3c0b-d416e5360623
Content: cmap - Character To Glyph Index Mapping Table - Typography
Content Source: typographydocs/opentype/spec/cmap.md
Product: typography

PeterCon commented 4 years ago

I'm inclined to think this would be better considered a shaping spec issue rather than a font file format issue. If nothing else, it needs to be informed by consideration of shaping specs.

For example, U+09CB has a canonical decomposition to U+09C7 + U+09BE, and appropriate treatment of that needs to be considered in the context of a spec for Bengali shaping.

So, in the general case, I think this expands well beyond the scope of the OT spec itself, and it doesn't make sense to treat Latin precomposed characters as special.

I'll leave this open for future consideration as an OT issue. If you'd like the issue remapped to a shaping spec page, that can be done.

NorbertLindenberg commented 4 years ago

A specification for a file format is rather pointless if it leaves the interpretation of its data structures up to the imagination of its readers. How to handle Unicode normalization is an issue for OpenType font producers and font consumers in general, and so it needs to be documented in the core spec.

Shaping engine documentation then needs to be updated to clarify how it handles mismatches between the overall approach to normalization and its own assumptions, but that’s a separate second step.

PeterCon commented 4 years ago

I don't think there's any ambiguity about interpretation of data structures in the cmap table: if a client searches for and finds a mapping for U+00E4, it will get a default glyph ID. That's completely unambiguous.

There may be ambiguity about how to process Unicode character sequences prior to making cmap lookups. And that may result in ambiguity for font developers about what cmap entries should be included. But that's not really separate from ambiguity as to what ccmp or other substitutions might be needed. It's all just one aspect of the need for script-implementation / shaping specs beyond the core OT spec.

If there were some general principle regarding normalization that would make sense as a default assumption across all scripts, then I could perhaps see that as being valid to include in the OT spec. The only such principle I can think of that would be safe to make at this point would be:

When creating a Unicode encoding 'cmap' subtable, font developers should not make any assumptions regarding how applications will handle Unicode normalization during the text layout process. Thus, 'cmap' entries should be included for precomposed characters and also for each of the characters in the canonically-equivalent decomposition. There may be additional factors that need to be considered on a script-by-script basis, such as whether a font needs to include certain glyph substitutions using 'ccmp' or other features. These are outside the scope of this specification.

If that's along the lines of what you had in mind, that could be added in the Recommendations chapter.

NorbertLindenberg commented 4 years ago

I don't think there's any ambiguity about interpretation of data structures in the cmap table: if a client searches for and finds a mapping for U+00E4, it will get a default glyph ID. That's completely unambiguous.

This is locally correct, and at the same time useless in the broader context of creating interoperable fonts, font producers, and font consumers.

When creating a Unicode encoding 'cmap' subtable, font developers should not make any assumptions regarding how applications will handle Unicode normalization during the text layout process. Thus, 'cmap' entries should be included for precomposed characters and also for each of the characters in the canonically-equivalent decomposition. There may be additional factors that need to be considered on a script-by-script basis, such as whether a font needs to include certain glyph substitutions using 'ccmp' or other features. These are outside the scope of this specification.

If that's along the lines of what you had in mind, that could be added in the Recommendations chapter.

No, that’s not what I have in mind. What I have in mind is a specification that’s comprehensive enough that newcomers can develop an interoperable rendering system based on it, and that font producers know what they have to do to create interoperable fonts. Reality is that renderers will be presented with text that’s in NFC, or in NFD, or unnormalized, and some rendering systems normalize and others don’t, and some shaping engines assume their own normalized forms that are incompatible with Unicode normalization. This reflects that Microsoft and ISO have failed to deliver a spec that enables interoperability.

PeterCon commented 4 years ago

This reflects that Microsoft and ISO have failed to deliver a spec that enables interoperability.

i agree with one qualification: I'd say specs, not spec. The OT/OFF specs are definitely not complete specifications for purposes of ensuring interoperable fonts and layout implementations for display of Unicode text. Since the 1990s, there has always been a big gap in regards to shaping and script-specific implementation requirements. You're absolutely right that

renderers will be presented with text that’s in NFC, or in NFD, or unnormalized

And there is no guarantee of interoperability when

some rendering systems normalize and others don’t, and some shaping engines assume their own normalized forms that are incompatible with Unicode normalization

with no common specification followed in all rendering implementations. I absolutely agree that this is required and that such specs should be developed.

The whole thing is going to be fairly complex. There are two interfaces to consider:

The DrawText() API, which takes a Unicode string that might be NFC, NFD or unnormalized
The interface between the rendering engine and a font. This has subcomponents:
- The cmap entries lookups that will get made—may be affected by normalization, bidi mirroring or other factors
- The (mandatory) features that will get applied over sub-ranges of the string, based on the string content and certain higher-level meta-information, including layout orientation, bidi levels and language.
- The stages in which lookups associated with features will get processed.
- The transforms to glyph sequences that the rendering engine will perform before any of the stages of lookup processing

There will be several details here that will need script-specific treatment.

All such details need a common specification to ensure interoperability. But IMO that's too much to cram into the core OT spec. There are already some snippets (e.g., OMPL and related info), but I'd be more inclined to move those out of the OT spec into some separate companion specs than to add more of this type of content. Because until all of the above details are specified, there is no assurance of interop.

behdad commented 4 years ago

This is an area worth exploring and improving.

As I understand it, currently HarfBuzz is the only implementation that tries to reconcile Unicode Normalization with OpenType shaping, by trying to figure out what equivalent sequence to the input the font can handle and use that. I think it's a good cause to try to formalize that and encourage other shaping engines to implement.

behdad commented 4 years ago

The HarfBuzz implementation closely follows the Unicode Normalization Algorithm, but is tailored to work with various script-specific OpenType shapers:

https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-ot-shape-normalize.cc

I can put that in a pseudo-code form if there is interest.

mikeday commented 4 years ago

@behdad I would be interested to see that, we were also planning to investigate font-guided normalisation approaches, so it would make sense not to reinvent the wheel.

MicrosoftDocs / typography-issues

[cmap] Documentation of cmap table lacks information on normalization #591

Document Details