Add check: Unicode ID for diacritic marks

vv-monsalve commented 3 years ago

Related to Alumni #5, even if few marks don't have the proper IDs, the composition of the text will be seriously affected, particularly for browsers that don't use the precomposed glyphs and rather compose the glyphs with GPOS, as apparently Firefox and Edge are doing.

Checking this 'by eye' is prone to human errors with missing cases, and tricky to catch as it would work on other browsers like Chrome or Safari.

@RosaWagner @m4rc1e

bobh0303 commented 3 years ago

Are you talking about Unicode ID, as in where it appears in a cmap? This raises the question of how you can know if a glyph has the correct encoding. Perhaps such a test already exists, but if the glyphs are named in an AGL-compliant way, one could parse the glyph name to determine what Unicode, if any, it should have and compare with what it does have. Is that what you are thinking?

vv-monsalve commented 3 years ago

Are you talking about Unicode ID, as in where it appears in a cmap? This raises the question of how you can know if a glyph has the correct encoding. Perhaps such a test already exists, but if the glyphs are named in an AGL-compliant way, one could parse the glyph name to determine what Unicode, if any, it should have and compare with what it does have. Is that what you are thinking?

Yes, indeed. That could be the mechanics of it.

Comparing now the cmap tables of the two fonts (before-bad/after-good) of the mentioned issue, the marks that were the culprits of the situation either lacked Unicode ID or had it wrong.

Screen Shot 2021-06-22 at 13 09 44

Although inspecting the AGL lists here and here, there is no circumflexcomb or circumglexcmb listed as it is for example it is acutecomb. So I wonder if that could be the only source of truth.

bobh0303 commented 3 years ago

Although inspecting the AGL lists here and here, there is no circumflexcomb or circumglexcmb listed as it is for example it is acutecomb.

You mean like this one:

circumflexcmb;0302

?

The lists+algorithm described in https://github.com/adobe-type-tools/agl-aglfn are indeed "the only source of truth" if what you are wanting to do is confirm that the cmap entry for, and name of, a specific glyph correspond -- there is no other standard.

vv-monsalve commented 3 years ago

Oh yes, my bad.

bobh0303 commented 3 years ago

Encouraging use of AGL-compliant names is (imo) a good idea. Here's my take:

There are at least two things to test:

Every glyph name (except those used only as components plus a handful of others such as .notdef, NULL, nonmarkingreturn, etc.) should be parsable using the AGL algorithm to come up with one more Unicode scalar values (USVs). -- If the exceptions are difficult to determine or agree upon, this test could generate WARNs rather than FAILs.
Every non-zero cmap entry should point to a glyph for which the first test generated a single USV and it is the same USV as the cmap. Not meeting this criteria should result in FAILs.

thlinard commented 3 years ago

The lists+algorithm described in https://github.com/adobe-type-tools/agl-aglfn are indeed "the only source of truth" if what you are wanting to do is confirm that the cmap entry for, and name of, a specific glyph correspond -- there is no other standard.

Well, no exactly. RoboFont uses GNFUL. There are some little differences with the AGLFN (therefore glyphs which won't be parsed by Adobe applications, which is a problem, unless the name is in the old list https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt), notably:

increment instead of Delta
ohm instead of Omega
mu.math instead of mu
commaaccentbelowcmb instead of uni0326 (since AGLFN 1.7) or uniF6C3 (https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt) – and all the "commaaccent" names.
dotbelowcmb instead of dotbelowcomb (but dotbelowcmb and all the *cmb names are in https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt)
acutecmb instead of acutecomb
gravecmb instead of gravecomb
tildecmb instead of tildecomb
hookabovecmb instead of hookabovecomb

These inconsistencies could be resolved by the use uni<hex> names instead.

bobh0303 commented 3 years ago

Well, no exactly. RoboFont uses GNFUL.

I don't think glyph names used within font editors are the concern, but rather the glyph names in the resulting TTF. For example GlyphsApp has its own glyph name conventions for editing, but writes the TTF using AGL-compatible.

There are some little differences with the AGLFN (therefore glyphs which won't be parsed by Adobe applications, which is a problem,

Right. That's why I believe that there is only one "standard" for glyph names in a TTF (if glyph names are provided at all -- they aren't required in TTFs these days) and that standard is AGL.

unless the name is in the old list https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt), notably:

I'll note that in according to AGL:

increment;2206 == Delta;2206
Ohm;2126 == Omega;2126
mu;math == mu;00B5 because glyph name extension are ignored by the algorithm

so I'm not sure why these are a concern.

The following would indeed be differences that would prevent Adobe apps from interpreting the GNFUL names:

commaaccentbelowcmb instead of uni0326 (since AGLFN 1.7) or uniF6C3 (https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt) – and all the "commaaccent" names.

dotbelowcmb instead of dotbelowcomb (but dotbelowcmb and all the *cmb names are in https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt)

Not sure I understand what you are saying here:

acutecmb instead of acutecomb

gravecmb instead of gravecomb

tildecmb instead of tildecomb

hookabovecmb instead of hookabovecomb

thlinard commented 3 years ago

I don't think glyph names used within font editors are the concern, but rather the glyph names in the resulting TTF. For example GlyphsApp has its own glyph name conventions for editing, but writes the TTF using AGL-compatible.

Of course. It doesn't matter what a editor does internally. If I mentioned RoboFont, it's for the result of the cmap table generated by RoboFont.

To summarize, Adobe offers two lists:

AGLFN (Adobe Glyph List For New Fonts): a list of base glyph names that are recommended for new fonts. For a new font, all names not in this list must be in the uni<hex> format.
AGL (Adobe Glyph List): an older list, much larger, provided for retro-compatibility.

GNFUL deviates from AGLFN on several names, but this doesn't matter for several of them, since they're in AGL (for example, the names of the combined diacritics are *comb in AGLFN, but *cmb in GNFUL, but the *cmb names are also in AGL, so it doesn't matter).

The problem is with names that are not in AGL, because they won't be parsed in Adobe applications, in which case it's best to replace them with uni<hex> names.

GNFUL names not in AGL (problematic names in an Adobe environment):

ohm (case matters)
commaaccentbelowcmb

For the other names, even if they don't follow Adobe's recommendations for a new font, they will be parsed thanks to their presence in AGL.

Is it clearer like this?

bobh0303 commented 3 years ago

Absolutely clear. So getting back to the design of an encoding test for fontbakery, do you concur with my original suggestion:

There are at least two things to test:

Every glyph name (except those used only as components plus a handful of others such as .notdef, NULL, nonmarkingreturn, etc.) should be parsable using the AGL algorithm to come up with one more Unicode scalar values (USVs). -- If the exceptions are difficult to determine or agree upon, this test could generate WARNs rather than FAILs.

Every non-zero cmap entry should point to a glyph for which the first test generated a single USV and it is the same USV as the cmap. Not meeting this criteria should result in FAILs.

thlinard commented 3 years ago

do you concur with my original suggestion

Completely! My comment on GNFUL was only for the sake of completeness.

fonttools / fontbakery

Add check: Unicode ID for diacritic marks #3347