[cmap] Clarity needed on interaction of multiple cmap subtables

Lorp commented 4 years ago

It would be good to clarify the intended interaction of multiple cmap subtables with the same format/encoding. It appears not to be specified whether an implementation is supposed to choose one of the formats (how should it choose?) or create a union of all those formats that share a particular format/encoding.

One option would be that an implementation maintains an ordered list of Format IDs, from which it uses the first that it finds and discards the others. But Format 14 (Unicode Variation Sequences) talks about handling only a specific type of mapping, so seems suited to acting on only on a subset of Unicode, and Format 13 (many-to-one) has obvious use cases as a supplemental mapping to, say, Format 4 for the rest of the characters.

I can also imagine implementors of font creation tools tempted to use format 4 for Unicode IDs < 65536 and format 12 for IDs >= 65536 in the same font, yet the sentence in the Format 12 spec "This is the standard character-to-glyph-index mapping table for the Windows platform for fonts supporting Unicode supplementary-plane characters" hints that a Format 4 subtable also in the font could be ignored.

Assuming subtables are intended to be unified, then handling of overlaps in coverage must be considered too. "Undefined behaviour" is more useful than leaving implementors to figure it out.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 86923ff4-2b60-9144-e2b4-77cdb36fd698
Version Independent ID: ea221e53-f51a-8233-3c0b-d416e5360623
Content: cmap - Character To Glyph Index Mapping Table - Typography
Content Source: typographydocs/opentype/spec/cmap.md
Product: typography
GitHub Login: @PeterCon
Microsoft Alias: PeterCon

khaledhosny commented 4 years ago

All subtables except format 14 are exclusive, if one is used the other are discarded. But the spec should be clear. It would be nice also to recommend and ordered list of (platform, format) subtables for font consumers to use, since it is currently up to each implementation. For font producers, it should discourage creating legacy platforms and/or subtable formats.

Lorp commented 4 years ago

It would be good for the spec make a recommendation on how to prioritize multiple Formats, if more than one exists in the 3/1 platform. A cmap processor would then use the first of these Format combinations that exist in the font. Based on discussion in #269 and @khaledhosny’s comment above I propose:

F12 + F14 if present
F4 + F14 if present
F14
F13
F10 + F14 if present
F6 + F14 if present

It would be good to know if existing implementations prioritize similarly.

PeterCon commented 4 years ago

Before adding a detailed priority list, I'd want to see broad discussion and consensus.

But Khaled's comment is a conservative statement that is almost certainly consistent with any existing implementations. So, I'll propose this draft text to go in the "Encoding records and encodings" section just before the "Unicode platform (platform ID = 0)" heading:

Apart from a format 14 subtable, all other subtables are exclusive: applications should select and use one and ignore the others. If a Unicode subtable is used (platform 0, or platform 3 / encoding 1 or 10), then a format 14 subtable using platform 0/encoding 5 can also be supplemented for mapping Unicode Variation Sequences.

(Looking at additional changes for this issue.)

PeterCon commented 4 years ago

I'm guessing this will be a not controversial revision wrt platform 4:

Platform ID 4 is a legacy platform that was created to provide compatibility of older applications with OpenType fonts that had been adapted from older Type 1 fonts. This platform is not commonly used today, and should not be used in new fonts.

This 'cmap' encoding provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts, that Adobe shipped in the past had “0” (Windows ANSI) recorded in the CharSet field of the .PFM file; ATM for Windows 9x ignores the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OTF converted from a Type1 font in which the Encoding is not StandardEncoding.

If a platform ID 4 (custom), encoding ID 0-255 (OTF Windows NT compatibility mapping) 'cmap' encoding is present in an OpenType font with CFF outlines, then the OTF font driver in Windows NT will: (a) superimpose the glyphs encoded at character codes 0-255 in the encoding on the corresponding Windows ANSI (code page 1252) Unicode values in the Unicode encoding it reports to the system; (b) add Windows ANSI (CharSet 0) to the list of CharSets supported by the font; and (c) consider the value of the encoding ID to be a Windows CharSet value and add it to the list of CharSets supported by the font. Note: The 'cmap' subtable must use Format 0 or 6 for its subtable, and the encoding must be identical to the CFF’s encoding.

This 'cmap' encoding is not required. It provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts, that Adobe shipped in the past had “0” (Windows ANSI) recorded in the CharSet field of the .PFM file; ATM for Windows 9x ignores the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OTF converted from a Type1 font in which the Encoding is not StandardEncoding.

PeterCon commented 4 years ago

I can also imagine implementors of font creation tools tempted to use format 4 for Unicode IDs < 65536 and format 12 for IDs >= 65536 in the same font, yet the sentence in the Format 12 spec "This is the standard character-to-glyph-index mapping table for the Windows platform for fonts supporting Unicode supplementary-plane characters" hints that a Format 4 subtable also in the font could be ignored.

These details are already stated clearly in the "Encoding records and encodings" | "Windows platform (platform ID = 3)" section:

Fonts that support Unicode supplementary-plane characters (U+10000 to U+10FFFF) on the Windows platform must have a format 12 subtable for platform ID 3, encoding ID 10. To ensure backward compatibility with older software and devices, a format 4 subtable for platform ID 3, encoding ID 1 is also required. The characters supported in the format 4 subtable must be a subset of the characters in the format 12 subtable and should include all of the Unicode BMP characters supported by the font.

And the format 12 description already includes a link to that section. To avoid potential confusion, though, some of that could be repeated in the format 12 description:

This is the standard character-to-glyph-index mapping ~~table~~subtable for the Windows platform for fonts supporting Unicode character repertoires that include supplementary-plane characters (U+10000 to U+10FFFF). See [Windows platform (platform ID = 3)][3] above for additional details regarding subtable formats for Unicode encoding on the Windows platform.

Note: For compatibility with older applications, fonts with a format 12 subtable should also include a format 4 subtable. The characters mapped in a format 4 subtable must be a subset of those mapped in the format 12 subtable and should include all Unicode BMP characters supported in the font. The format 12 table should include all BMP and supplementary-plane characters supported by the font.

(The note is worded in a way that's intended to be platform agnostic.)

PeterCon commented 4 years ago

Summarizing proposed changes for the next version (including some changes made for other related issues). This doesn't go as far as we'd all eventually like, which is all platforms supporting the same platform/encoding and a short list of formats. These are reasonably conservative changes in scope for 1.8.4.

Overview: add new paragraph:

Of the seven available formats, not all are commonly used today. Formats 4 or 12 are appropriate for most new fonts, depending on the Unicode character repertoire supported. Format 14 is used in many applications for support of Unicode variation sequences. Some platforms also make use for format 13 for a last-resort fallback font. Application developers should anticipate that other formats may also be used in fonts.

Encoding records and encodings: add new paragraph:

Apart from a format 14 subtable, all other subtables are exclusive: applications should select and use one and ignore the others. If a Unicode subtable is used (platform 0, or platform 3 / encoding 1 or 10), then a format 14 subtable using platform 0/encoding 5 can also be supplemented for mapping Unicode Variation Sequences.

Macintosh platform (platform ID = 1):

~~When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0.~~Older Macintosh versions required fonts to have a 'cmap' subtable for platform ID 1. For current Apple platforms, use of platform ID 1 is discouraged.

Windows platform (platform ID = 3):

When building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1. When building a symbol font for Windows, the platform ID should be 3 and the encoding ID should be 0.

Microsoft strongly recommends using Unicode 'cmap' subtables for all fonts. However, other non-Unicode encodings are also used in existing fonts with the Windows platform. The following are encoding IDs defined for the Windows platform:

The Windows platform supports several encodings. When creating fonts for Windows, Unicode 'cmap' subtables should always be used—platform ID 3 with encoding ID 1 or encoding ID 10. See below for additional details.

The following encoding IDs are supported on the Windows platform:

... ~~When building a symbol font for Windows, the platform ID should be 3 and the encoding ID should be 0.~~The symbol encoding was created to support fonts with arbitrary ornaments or symbols not supported in Unicode or other standard encodings. A format 4 subtable would be used, typically with up to 224 graphic characters assigned at code positions beginning with 0xF020. This corresponds to a sub-range within the Unicode Private-Use Area (PUA), though this is not a Unicode encoding. In legacy usage, some applications would represent the symbol characters in text using a single-byte encoding, and then map 0x20 to the OS/2.usFirstCharIndex value in the font. In new fonts, symbols or characters not in Unicode should be encoded using PUA code points in a Unicode 'cmap' subtable.

...

Custom platform (platform ID = 4):

Platform ID 4 is a legacy platform that was created to provide compatibility of older applications with OpenType fonts that had been adapted from older Type 1 fonts. This platform is not commonly used today, and should not be used in new fonts.

This 'cmap' encoding provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts, that Adobe shipped in the past had “0” (Windows ANSI) recorded in the CharSet field of the .PFM file; ATM for Windows 9x ignores the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OTF converted from a Type1 font in which the Encoding is not StandardEncoding.

If a platform ID 4 (custom), encoding ID 0-255 (OTF Windows NT compatibility mapping) 'cmap' encoding is present in an OpenType font with CFF outlines, then the OTF font driver in Windows NT will: (a) superimpose the glyphs encoded at character codes 0-255 in the encoding on the corresponding Windows ANSI (code page 1252) Unicode values in the Unicode encoding it reports to the system; (b) add Windows ANSI (CharSet 0) to the list of CharSets supported by the font; and (c) consider the value of the encoding ID to be a Windows CharSet value and add it to the list of CharSets supported by the font. Note: The 'cmap' subtable must use Format 0 or 6 for its subtable, and the encoding must be identical to the CFF’s encoding.

This 'cmap' encoding is not required. It provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts, that Adobe shipped in the past had “0” (Windows ANSI) recorded in the CharSet field of the .PFM file; ATM for Windows 9x ignores the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OTF converted from a Type1 font in which the Encoding is not StandardEncoding.

Format 0:

~~This is the Apple standard character to glyph index mapping table.~~Format 0 was the standard mapping subtable used on older Macintosh platforms but is not required on newer Apple platforms.

Format 2:

~~This subtable is useful for~~This subtable format was created for “double-byte” encodings following the national character code standards used for Japanese, Chinese, and Korean characters. These code standards use a mixed 8-/16-bit encoding~~, in which certain byte values signal the first byte of a 2-byte character (but these values are also legal as the second byte of a 2-byte character)~~. This format is not commonly used today.

In these mixed 8-/16-bit encodings, certain byte values signal the first byte of a 2-byte character. (These byte values are also legal as the second byte of a 2-byte character.) In addition, even for the 2-byte characters, the mapping of character codes to glyph index values depends heavily on the first byte. Consequently, the table begins with an array that maps the first byte to a SubHeader record. For 2-byte character codes, the SubHeader is used to map the second byte’s value through a subArray, as described below. When processing mixed 8-/16-bit text, SubHeader 0 is special: it is used for single-byte character codes. When SubHeader 0 is used, a second byte is not needed; the single byte value is mapped through the subArray.

Format 4:

This is the standard character-to-glyph-index mapping table ~~for the Windows platform~~commonly used for fonts that support Unicode BMP characters. ~~See [Windows platform (platform ID = 3)][3] above for additional details regarding subtable formats for Unicode encoding on the Windows platform.~~(To support Unicode supplementary-plane characters, format 12 should be used.)

Format 6: add intro paragraph

Format 6 was designed to map 16-bit characters to glyph indexes when the character codes for a font fall into a single contiguous range.

Format 8:

Subtable format 8 was designed to support Unicode supplementary-plane characters in UTF-16 encoding, though it is not commonly used. Format 8 is similar to format 2, in that it provides for mixed-length character codes. Instead of allowing for 8- and 16-bit character codes, however, it allows for 16- and 32-bit character codes.

Format 10:

Subtable format 8 was designed to support Unicode supplementary-plane characters, though it is not commonly used. Format 10 is similar to format 6, in that it defines a trimmed array for a tight range of character codes. It differs, however, in that is uses 32-bit character ~~codes:~~codes.

Format 12:

This is the standard character-to-glyph-index mapping ~~table~~subtable for the Windows platform for fonts supporting Unicode character repertoires that include supplementary-plane characters (U+10000 to U+10FFFF). See [Windows platform (platform ID = 3)][3] above for additional details regarding subtable formats for Unicode encoding on the Windows platform.

Note: For compatibility with older applications, fonts with a format 12 subtable should also include a format 4 subtable. The characters mapped in a format 4 subtable must be a subset of those mapped in the format 12 subtable and should include all Unicode BMP characters supported in the font. The format 12 table should include all BMP and supplementary-plane characters supported by the font.

behdad commented 4 years ago

It would be good to know if existing implementations prioritize similarly.

Here's what HarfBuzz / FontTools do: https://github.com/MicrosoftDocs/typography-issues/issues/269#issuecomment-689135816

PeterCon commented 4 years ago

Also related changes in the Recommendations chapter:

'cmap' Table

~~### Windows 'cmap' Table~~

When building a font for Windows, a 'cmap' subtable for platform ID 3 should be included. When building a Unicode font, encoding ID 1 should be used for this subtable. (This subtable must use format 4.) When building a symbol font for Windows, encoding ID 0 should be used for this subtable.

When building a font to support Unicode supplementary characters (U+10000 to U+10FFFF)), include a 'cmap' subtable for platform ID 3, encoding ID 10. (This subtable must use format 12.) To provide compatibility with older software, a subtable for platform 3, encoding ID 1 should also be included. Depending on application support and the content of text being displayed, either the 3/1/4 or 3/10/12 subtable may be used. Therefore, glyph mappings for characters in the range U+0000 to U+FFFF must be identical between the 3/1/4 and 3/10/12 subtables. Also note that the characters mapped in the 3/10/12 subtable must be a superset of the characters mapped in the 3/1/4 subtable.

When creating a font to support Unicode supplementary-plane characters (U+10000 to U+10FFFF), a format 12 subtable is required. Older applications might not support a format 12 subtable, and so a format 4 subtable can also be included. Since either the format 4 or format 12 subtable may be used in different contexts, the glyph mappings for characters in the range U+0000 to U+FFFF should be identical.

On the Windows platform, a format 4 subtable should use platform ID 3, encoding ID 1; a format 12 subtable should use platform ID 3, encoding ID 10. Other platforms may also support format 4 or 12 subtables using the Unicode platform (platform ID 0).

~~Remember that encoding records must be stored in sorted order by platform ID, then by encoding ID.~~

~~### Macintosh 'cmap' Table~~

When building a font containing Roman characters that will be used on the Macintosh, an additional subtable is required, specifying platform ID of 1 and encoding ID of 0 (this subtable may use 'cmap' formats 0, 2, 4, or 6).

In order for the Macintosh 'cmap' table to be useful, the glyphs required for the Macintosh must have glyph indices less than 256 (since the 'cmap' subtable format 0 uses uint8 indices and therefore cannot index any glyph above 255).

~~The Apple 'cmap' subtable should be constructed according to Apple guidelines.~~

behdad commented 4 years ago

When creating a font to support Unicode supplementary-plane characters (U+10000 to U+10FFFF), a format 12 subtable is required. Older applications might not support a format 12 subtable, and so a format 4 subtable can also be included. Since either the format 4 or format 12 subtable may be used in different contexts, the glyph mappings for characters in the range U+0000 to U+FFFF should be identical.

How much longer are we going to keep this "Older applications might not..." there? 15 years not long enough?

behdad commented 4 years ago

Please make sure it's legal according to the spec to have only a Format 12 and no Format 4.

PeterCon commented 4 years ago

Revision to draft based on review feedback:

In the 'cmap' chapter, Encoding records and encodings > Windows platform (platform ID = 3) section:

Fonts that support Unicode supplementary-plane characters (U+10000 to U+10FFFF) on the Windows platform must have a format 12 subtable for platform ID 3, encoding ID 10. To ensure backward compatibility with older software and devices, a format 4 subtable for platform ID 3, encoding ID 1 is also required. The characters supported in the format 4 subtable must be a subset of the characters in the format 12 subtable and should include all of the Unicode BMP characters supported by the font.

See the Recommendations chapter for additional information.

In the 'cmap' chapter, Format 12 section:

Removed this addition that was in the previous draft:

Note: For compatibility with older applications, fonts with a format 12 subtable should also include a format 4 subtable. The characters mapped in a format 4 subtable must be a subset of those mapped in the format 12 subtable and should include all Unicode BMP characters supported in the font. The format 12 table should include all BMP and supplementary-plane characters supported by the font.

Instead, added the following:

Fonts that include a format 12 subtable can also include a format 4 subtable for compatibility with older applications. This is not required, however. See the Recommendations chapter for additional information.

In the Recommendations chapter:

When creating a font to support Unicode supplementary-plane characters (U+10000 to U+10FFFF), a format 12 subtable is required. Older applications might not support a format 12 subtable, and so a format 4 subtable can also be included. A format 4 subtable is not required, however. If both are included, either subtable may be used in different contexts, and so the glyph mappings for characters in the range U+0000 to U+FFFF should be identical. The format 12 table should include all BMP and supplementary-plane characters supported by the font.

Lorp commented 4 years ago

Can you clarify this section in the "Encoding records and encodings" intro.

Each platform ID, platform-specific encoding ID, and subtable language combination may appear only once in the 'cmap' table.

I think the intention is that identical combinations platform ID, platform-specific encoding ID, subtable language and format may appear only once.

I’d like some clarification on the relationship between Format and Encoding, specifically:

May a Format 4 subtable use Windows Encoding 10?
May a Format 12 subtable that covers only BMP use Windows Encoding 1?
May a Format 13 subtable use either Windows Encoding 1 or Windows Encoding 10?

PeterCon commented 4 years ago

I don't think it was ever intended that there could be two different format subtables for the same platform/encoding/language combination. Am I wrong? What's a scenario in which that might be done?

PeterCon commented 4 years ago

I’d like some clarification on the relationship between Format and Encoding, specifically:

May a Format 4 subtable use Windows Encoding 10?

May a Format 12 subtable that covers only BMP use Windows Encoding 1?

May a Format 13 subtable use either Windows Encoding 1 or Windows Encoding 10?

Here's a proposed revision:

Fonts that support Unicode BMP characters (U+0000 to U+FFFF) on the Windows platform must have a format 4 'cmap' subtable for platform ID 3, platform-specific encoding 1. Only a format 4 subtable should be used for platform 3, encoding 1. This encoding must not be used to support Unicode supplementary-plane characters.

Fonts that support Unicode supplementary-plane characters (U+10000 to U+10FFFF) on the Windows platform must have a format 12 subtable for platform ID 3, encoding ID 10. Only a format 12 subtable should be used for platform 3, encoding 10.To ensure backward compatibility with older software and devices, a format 4 subtable for platform ID 3, encoding ID 1 is also required. The characters supported in the format 4 subtable must be a subset of the characters in the format 12 subtable and should include all of the Unicode BMP characters supported by the font.

This doesn't rule out the possibility of using 3/1 with a format 12 subtable that maps only BMP characters, or the possibility of using 3/10 with a format 4 subtable.

And it doesn't rule out using format 13 with 3/1 or 3/10. But IIRC correctly none of GDI, GDI+, WPF or DWrite support format 13, so I don't know that it would be particularly useful.

Lorp commented 4 years ago

I don't think it was ever intended that there could be two different format subtables for the same platform/encoding/language combination. Am I wrong? What's a scenario in which that might be done?

Sorry, I assumed a font with Format 12 + Format 14 would cause this. Now I realize Format 14 always uses platform 0, encoding 5.

PeterCon commented 4 years ago

The Unicode platform section already states:

A format 14 subtable must only be used under platform ID 0 and encoding ID 5.

So that it's less likely to be overlooked, that could be repeated in the Format 14 section.

Lorp commented 4 years ago

Yes, I finally read the text you quote. Indeed, I would have expected it in the Format 14 section.

PeterCon commented 4 years ago

Updated draft:

Format 14: Unicode Variation Sequences

Subtable format 14 specifies the Unicode Variation Sequences (UVSes) supported by the font. A Variation Sequence, according to the Unicode Standard, comprises a base character followed by a variation selector. For example, <U+82A6, U+E0101>.

This subtable format must only be used under platform ID 0 and encoding ID 5.

NorbertLindenberg commented 4 years ago

It would be good to know if existing implementations prioritize similarly.

Apple’s documentation shows this order:

platform 0 / encoding 4 platform 0 / encoding < 4 platform 3 / encoding 10 platform 3 / encoding 1 platform 3 / encoding 0

Unfortunately not the same as HarfBuzz.

Note that both HarfBuzz and CoreText prioritize by the combination platform/encoding; the format is not considered.

nedley commented 4 years ago

Apple’s documentation shows this order:

That table is listing supported combinations, not the order of preference.

behdad commented 4 years ago

It would be good to know if existing implementations prioritize similarly.

Apple’s documentation shows this order:

platform 0 / encoding 4 platform 0 / encoding < 4 platform 3 / encoding 10 platform 3 / encoding 1 platform 3 / encoding 0

Unfortunately not the same as HarfBuzz.

Pretty sure their code is closer to HarfBuzz. Eg. I'm sure they consider 0/6 as well.

nedley commented 4 years ago

Apple’s current order is:

3/10 0/4 0/3 3/1 0/2 0/1 0/0 0/6 3/0

NorbertLindenberg commented 4 years ago

Moving comment from #614 here as requested by Peter:

Please remove the recommendations for the cmap table from the Recommendations page. Platforms have converged enough that requirements for new fonts can be clearly specified in the specification for the cmap table, as discussed in #269. The web and most font vendors depend on fonts to work across platforms.

NorbertLindenberg commented 4 years ago

Please reopen this bug. We’re clearly not done discussing it.

behdad commented 4 years ago

Apple’s current order is:

3/10 0/4 0/3 3/1 0/2 0/1 0/0 0/6 3/0

Thanks Ned.

This is very close to what HarfBuzz does. Can we discuss three differences I see:

0/6 seems to be added to Apple later and just added at the end. I think it belongs after 3/10 before 0/4.
We have 0/3 after 3/1, you have before. That seems arbitrary and insignificant. I put it after 3/1 because the same way both of us decided to put 0/ after 3/10. That is, 0/3 is a BMP-only encoding so I put it after the 3/1.
We prefer 3/0 (symbol) over everything else. You seem to put it last. That was done for us as https://github.com/harfbuzz/harfbuzz/issues/1918

PeterCon commented 4 years ago

I'll act on review feedback if there's something actionable without reactivating the issue.

nedley commented 4 years ago

This is very close to what HarfBuzz does. Can we discuss three differences I see:

I don’t see the position of 0/6 being very relevant: since it is meant for a format 13 subtable, using it with any other subtable format is missing the point.
I agree, the relative ordering of 0/3 and 3/1 is both arbitrary and insignificant.
We haven’t received any customer feedback regarding symbol fonts other than Wingdings but we can consider it in the future.

behdad commented 4 years ago

I don’t see the position of 0/6 being very relevant: since it is meant for a format 13 subtable, using it with any other subtable format is missing the point.

I disagree. 0/6 says: "Unicode full repertoire ('cmap' subtable formats 0, 4, 6, 10, 12, 13)." So it's not just Format13.

Basically, if a font has a Format 4 (BMP-only) 0/3 and a Format 12 0/6, you seem to pick the wrong one. No?

We haven’t received any customer feedback regarding symbol fonts other than Wingdings but we can consider it in the future.

I'll see if I can get someone do some tests so we can align one way or another.

nedley commented 4 years ago

I don’t see the position of 0/6 being very relevant: since it is meant for a format 13 subtable, using it with any other subtable format is missing the point.

I disagree. 0/6 says: "Unicode full repertoire ('cmap' subtable formats 0, 4, 6, 10, 12, 13)." So it's not just Format13.

Oy. The AAT spec says: “Full Unicode coverage (used with type 13.0 cmaps by OpenType)” and that’s why we invented 0/6. There’s no reason to use it in the 'name' table when there are already multiple valid encoding IDs, so I think it was a mistake to define 0/6 there as opposed to just the 'cmap' table.

NorbertLindenberg commented 4 years ago

Peter has asked me to review the changes he has proposed in this thread. I see the changes as falling into three buckets:

changes to address Laurence’s original issue.
changes to address the question I raised in #269, which cmap subtables are still relevant today?
general documentation enhancements.

For changes in the second bucket, I think they should be proposed in #269 and contrasted with the proposal I made there.

Comments on Peter’s comment 689214218:

Overview: Should be discussed in #269.
Encoding records and encodings: This is relevant to Laurence’s issue, but needs to be combined with the prioritization still being discussed.
Windows platform: The deletion and rewrite of the initial paragraphs should be discussed in #269. The update to the description of the symbol encoding is a general documentation enhancements; I’m not sure why we need it now when it’s clearly obsolete.
Custom platform: The rewrite of this paragraph only serves to highlight that this platform is obsolete, and should be discussed in #269.
Format 0, format 2, format 4: Should be discussed in #269.
Format 6: General documentation enhancement. Looks good.
Format 8 and format 10: General documentation enhancements. Have these formats ever been commonly used?
Format 12: Should be discussed in #269.

Comments on Peter’s comment 689283433:

Should be discussed in #269.

Comments on Peter’s comment 689771183:

All proposed changes should be discussed in #269.

Comments on Peter’s comment 689783643:

Should be discussed in #269.

Comments on Peter’s comment 689847851:

General documentation enhancement. Looks good.

behdad commented 4 years ago

Thanks @nedley

Oy. The AAT spec says: “Full Unicode coverage (used with type 13.0 cmaps by OpenType)” and that’s why we invented 0/6. There’s no reason to use it in the 'name' table when there are already multiple valid encoding IDs, so I think it was a mistake to define 0/6 there as opposed to just the 'cmap' table.

Ah, 0/6 just for Format 13 makes more sense indeed. OT should be updated to reflect that IMO.

Still, picking 0/6 before others makes sense to me; a last-resort font might choose to have a Format4 and a Format13...

Anyway. About symbol encodings, is it correct, I suppose, that CoreText picks up a MacRoman (1/0) encoding over a Unicode 1.0 (0/0) if available? That might explain what we saw in https://github.com/harfbuzz/harfbuzz/issues/1918

The font in question had:

Format 4 platform 0 enc 0 lang 0 0xFF?? (Unicode 1.0) Format 6 platform 1 enc 0 lang 0 0x?? (Mac Roman) Format 4 platform 3 enc 0 lang 0 0xFF?? (Windows Symbol)

We were ignoring MacRoman and picking up Unicode 1.0 and failing. So we made Symbol (3/0) have higher priority.

nedley commented 4 years ago

Still, picking 0/6 before others makes sense to me; a last-resort font might choose to have a Format4 and a Format13...

Adding format 4 would only help platforms that don’t support non-BMP, right? I’m having a hard time envisioning a scenario in which this would be useful.

Anyway. About symbol encodings, is it correct, I suppose, that CoreText picks up a MacRoman (1/0) encoding over a Unicode 1.0 (0/0) if available?

Not for layout, no.

behdad commented 4 years ago

Still, picking 0/6 before others makes sense to me; a last-resort font might choose to have a Format4 and a Format13...

Adding format 4 would only help platforms that don’t support non-BMP, right? I’m having a hard time envisioning a scenario in which this would be useful.

What I'm saying is that if a font does that, your logic seems to pick up the inferior subtable. A weird corner-case indeed.

Anyway. About symbol encodings, is it correct, I suppose, that CoreText picks up a MacRoman (1/0) encoding over a Unicode 1.0 (0/0) if available?

Not for layout, no.

I see. Thanks.

nedley commented 4 years ago

What I'm saying is that if a font does that, your logic seems to pick up the inferior subtable. A weird corner-case indeed.

Fair point.

NorbertLindenberg commented 3 years ago

The current beta draft of the cmap specification answers one question of @Lorp's original issue: If multiple subtables are present, only a format 14 subtable can be combined with another Unicode subtable; other subtables are exclusive.

However, it still does not answer the second question: How should an application choose the right subtable if more than one non-format-14 subtable is present?

I therefore would not consider this issue resolved.

PeterCon commented 3 years ago

However, it still does not answer the second question...

Updated the draft to cover this.

NorbertLindenberg commented 3 years ago

Do you mean this paragraph?

If a font includes encoding records for Unicode subtables of the same format but with different platform IDs, an application may choose which to select, but should make this selection consistently each time the font is used.

That doesn't really answer the question, and it fails to cover cases such as a combination of format 10 and 12 at all.

PeterCon commented 3 years ago

Or a combination of 4 and 6.

Updated the draft for this.

MicrosoftDocs / typography-issues

[cmap] Clarity needed on interaction of multiple cmap subtables #599

Document Details

'cmap' Table

Format 14: Unicode Variation Sequences