Pomax / lib-font

This library adds a new Font() object to the JavaScript toolbox, similar to new Image() for images
MIT License
728 stars 72 forks source link

better testing for GSUB #91

Closed Pomax closed 3 years ago

Pomax commented 3 years ago

This adds some better testing, while also revealing that the reverse function for glyph resolution may not be correct for all tables:

    console.log
      latn[SKS ].ccmp[19]: ligature [ A + ̌ ] -> ľ

      at forEach (testing/node/gsub/test-gsub.js:72:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      latn[SKS ].ccmp[19]: ligature [ A + ̨ ] -> ǫ

      at forEach (testing/node/gsub/test-gsub.js:72:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      latn[SKS ].ccmp[19]: ligature [ A + ̌ ] -> ť

      at forEach (testing/node/gsub/test-gsub.js:72:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      latn[SKS ].ccmp[19]: ligature [ A + ̨ ] -> ų

      at forEach (testing/node/gsub/test-gsub.js:72:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

Pretty sure those should all be capital letters A with diacritics attached =D

Pomax commented 3 years ago

@RoelN to run this: check out the branch, run npm install (because it has a few new dependencies) and then run npm test, which will run Jest testing for Node, using the ./testing/node dir, and Puppeteer testing for (headless) browser testing, using the ./testing/browser dir.

Running the tests will run this console log showing that reverse() is not doing the right thing for all tables yet (also, not all cmap subtables actually have implementations for reverse - notably, subtable 14)

RoelN commented 3 years ago

It was a bit of a wild goose chase, but my findings so far:

1.

Not all CMAP tables map to unicode. That's why I got false positives for Lato Bold. This has a fi and a fl ligature. The fi ligature has GlyphID 262. For CMAP 4 ( both platformID="0" platEncID="3" and platformID="3" platEncID="1"), Font.js can't find the matching character (0xfb01). So that might indicate a bug there, because this is properly noted in the CMAP table.

But for CMAP 6, the fi ligature character is mapped to 0xde instead of 0xfb01. As the CMAP 4 matching fails, this is the only match Font.js finds. But, this mapping of fi-to-0xde is an old Mac Roman mapping, as 0xde is "LATIN CAPITAL LETTER THORN" in Unicode.

So two things need to happen: make CMAP 4 work so fi gets mapped to 0xfb01, and don't map non-unicode codepoints.

2.

Regarding the A + ... examples for SourceCodePro-Regular.ttf you posted above, turns out it's not the diacritics or the result that is at fault, but the first character. If you look at the lookup, the first character changes all the time. But Font.js thinks it remains at A:

<LigatureSet glyph="A">
  <Ligature components="uni0328" glyph="Aogonek"/>
</LigatureSet>
<LigatureSet glyph="E">
  <Ligature components="uni0328" glyph="Eogonek"/>
</LigatureSet>
<LigatureSet glyph="G">
  <Ligature components="uni0303" glyph="uni00470303"/>
</LigatureSet>
<LigatureSet glyph="I">
  <Ligature components="uni0328" glyph="Iogonek"/>
</LigatureSet>
<LigatureSet glyph="IJ">
  <Ligature components="uni0301" glyph="uni01320301"/>
</LigatureSet>
<LigatureSet glyph="L">
  <Ligature components="periodcentered,L" glyph="uni004C00B7004C"/>
</LigatureSet>
<LigatureSet glyph="O">
  <Ligature components="uni0328" glyph="uni01EA"/>
</LigatureSet>
<LigatureSet glyph="U">
  <Ligature components="uni0328" glyph="Uogonek"/>
</LigatureSet>
<LigatureSet glyph="a">
  <Ligature components="uni0328" glyph="aogonek"/>
</LigatureSet>
<LigatureSet glyph="d">
  <Ligature components="uni030C" glyph="dcaron"/>
</LigatureSet>
<LigatureSet glyph="e">
  <Ligature components="uni0328" glyph="eogonek"/>
</LigatureSet>
<LigatureSet glyph="g">
  <Ligature components="uni0303" glyph="uni00670303"/>
  <Ligature components="uni0327" glyph="uni0123"/>
</LigatureSet>
<LigatureSet glyph="g.a">
  <Ligature components="uni0327" glyph="uni0123.a"/>
</LigatureSet>
<LigatureSet glyph="i">
  <Ligature components="uni0328" glyph="iogonek"/>
</LigatureSet>
<LigatureSet glyph="ij">
  <Ligature components="uni0301" glyph="uni01330301"/>
</LigatureSet>
<LigatureSet glyph="l">
  <Ligature components="periodcentered,l" glyph="uni006C00B7006C"/>
  <Ligature components="uni030C" glyph="lcaron"/>
</LigatureSet>
<LigatureSet glyph="o">
  <Ligature components="uni0328" glyph="uni01EB"/>
</LigatureSet>
<LigatureSet glyph="t">
  <Ligature components="uni030C" glyph="tcaron"/>
</LigatureSet>
<LigatureSet glyph="u">
  <Ligature components="uni0328" glyph="uogonek"/>
</LigatureSet>

The first line in your example is:

latn[SKS ].ccmp[19]: ligature [ A + ̌ ] -> ľ

Which is in fact l + caron = lcaron (LATIN SMALL LETTER L WITH CARON).

I ended at this point, and haven't dug into whether this is a simple error in the debug statement (accidentally logging A) or if its in the actual code (it actually gets stuck at A and doesn't pick up the next LigatureSet glyph).

Pomax commented 3 years ago

for point 1 it feels like reversing the GlyphID should not just return the subtable-specific code, but also the subtable itself so that it can be asked to turn that into something sensible instead of assuming all things are unicode. E.g. move the unicode logic into 4 and 12 but have other subformats use different logic for yielding the correct unicode character, based on transforming form their internal encoding.

I've updated the code to have a more detailed reveres() function that returns both the subtable's code for a glyphID and its equivalent unicode value, and enriched subtables with an awareness of their platform/encoding values so that we can start refining/implementing reverse() to yield proper unicode values even if the subtable itself is not inherently unicode.

Pomax commented 3 years ago

For point 2, we have the following code:

if (lookup.lookupType === 4) {
  lookup.subtableOffsets.forEach((_, i) => {
    const subtable = lookup.getSubTable(i);
    const coverage = subtable.getCoverageTable();
    subtable.ligatureSetOffsets.forEach((_, i) => {
      const ligatureSet = subtable.getLigatureSet(i);

      ligatureSet.ligatureOffsets.forEach((_, i) => {
        const ligatureTable = ligatureSet.getLigature(i);

        const sequence = [
          coverage.glyphArray[0],
          ...ligatureTable.componentGlyphIDs,
        ];

        console.log(
          `${script}[${lang}].${feature.featureTag}[${id}]: ligature [ ${
            sequence.map(letterFor).join(` + `)
          } ] -> ${
            letterFor(ligatureTable.ligatureGlyph)
          }`
        );
      });
    });
  });
}

So the A comes from the CoverageTable. https://docs.microsoft.com/en-us/typography/opentype/spec/gsub#41-ligature-substitution-format-1 shows that the coverage table is resolved via an offset:

Type Name Description
uint16 substFormat Format identifier: format = 1
Offset16 coverageOffset Offset to Coverage table, from beginning of substitution subtable
uint16 ligatureSetCount Number of LigatureSet tables
Offset16 ligatureSetOffsets[ligatureSetCount] Array of offsets to LigatureSet tables. Offsets are from beginning of substitution subtable, ordered by Coverage index

Which is implemented as:

class LookupType extends ParsedData {
  constructor(p) {
    super(p);
    this.substFormat = p.uint16;
    this.coverageOffset = p.Offset16;
  }
  getCoverageTable() {
    let p = this.parser;
    p.currentPosition = this.start + this.coverageOffset;
    return new CoverageTable(p);
  }
}

So the currentPosition should be getting set correctly, coded relative to the start of the subtable.

If we look at the coverage table code, we see:

class CoverageTable extends ParsedData {
  constructor(p) {
    super(p);

    this.coverageFormat = p.uint16;

    if (this.coverageFormat === 1) {
      this.glyphCount = p.uint16;
      this.glyphArray = [...new Array(this.glyphCount)].map((_) => p.uint16);
    }

    if (this.coverageFormat === 2) {
      this.rangeCount = p.uint16;
      this.rangeRecords = [...new Array(this.rangeCount)].map(
        (_) => new CoverageRangeRecord(p)
      );
    }
  }
}

And if we add the coverage format to the test log, we see:

    console.log
      latn[SKS ].ccmp[19]: ligature (coverage:1) [ A + ̧ ] -> ģ

      at forEach (testing/node/gsub/test-gsub.js:72:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

So let's examine coverage format 1, documented in https://docs.microsoft.com/en-us/typography/opentype/spec/chapter2#coverage-format-1:

Type Name Description
uint16 coverageFormat Format identifier — format = 1
uint16 glyphCount Number of glyphs in the glyph array
uint16 glyphArray[glyphCount] Array of glyph IDs — in numerical order

This is an array of applicable glyphs... but the test code pulls glyphArray[0] at every iteration... let's fix that:

        const sequence = [
          coverage.glyphArray[i], // changed from [0]
          ...ligatureTable.componentGlyphIDs,
        ];

And let's also console log the associated coverage table, ligature set, and ligature tables:

    console.log
      COVERAGE TABLE: CoverageTable {
        coverageFormat: 1,
        glyphCount: 19,
        glyphArray: [
           2,  6,   8,  10,  13, 16, 22,
          28, 31,  32,  34,  36, 39, 42,
          47, 48, 256, 464, 566
        ]
      }

      at forEach (testing/node/gsub/test-gsub.js:62:23)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      ligature set [0], ligature table [0]: latn[SKS ].ccmp[19]: ligature (coverage:1) [ A + ̨ ] -> Ą

      at forEach (testing/node/gsub/test-gsub.js:75:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      ligature set [1], ligature table [0]: latn[SKS ].ccmp[19]: ligature (coverage:1) [ A + ̨ ] -> Ę

      at forEach (testing/node/gsub/test-gsub.js:75:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

Are we using the specific ligature id rather than the ligature set id? Changing things to coverage.glyphArray[j] where j is the set id rather than the ligature id, we see:

    console.log
      ligature set [0], ligature table [0]: latn[SKS ].ccmp[19]: ligature (coverage:1) [ A + ̨ ] -> Ą

      at forEach (testing/node/gsub/test-gsub.js:73:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

    console.log
      ligature set [1], ligature table [0]: latn[SKS ].ccmp[19]: ligature (coverage:1) [ E + ̨ ] -> Ę

      at forEach (testing/node/gsub/test-gsub.js:73:27)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)
          at Array.forEach (<anonymous>)

And that makes a lot more sense. Turns out there was no error in the code itself, only in the test code.

RoelN commented 3 years ago

Excellent! This seems to report properly on my test fonts! Thanks for your thorough explanation.

For my purpose, I'd like to dismiss all sequences that contain glyphs not mapped to characters (f + i.italic) and keep only glyphs that are directly mapped to characters (f + i). This allows me to do that :-)

I'll be able to properly review tomorrow. Are there any remaining open tasks on this PR?

Pomax commented 3 years ago

I think the only thing here is to make sure that all cmaps have a sensible reverse implementation but aside from that I think mostly "adding tests", which can also be done in parallel if we want to merge this in first (after fixing up the reverse calls).