foliojs / fontkit

An advanced font engine for Node and the browser
1.48k stars 219 forks source link

Thai mark-to-mark positioning issue #134

Open mikeday opened 7 years ago

mikeday commented 7 years ago

Hi, if you try the NotoSansThai-Regular.ttf font with the fontkit-demo and the following text:

น้ำ

The ring mark will be positioned above the other mark, when it should be below. The browser correctly positions the marks in the opposite order, but honestly I don't know how! We have the same bug in Prince and I've checked the font quite carefully and can't see what's going on, have you encountered anything like this before?

mikeday commented 7 years ago

The text in question is: U+0E19 U+0E49 U+0E33

Converted to glyph indices: 30, 94, 56 (uni0E19, uni0E49.small, uni0E33)

After applying a ccmp lookup: 30, 94, 78, 55 (uni0E19, uni0E49.small, uni0E4D, uni0E32)

Then one more ccmp lookup: 30, 74, 78, 55 (uni0E19, uni0E49, uni0E4D, uni0E32)

So the last character has been decomposed and now there is a base followed by two marks and one more base which we can ignore.

There are no more GSUB substitutions and most importantly none that reorder the two marks, which appear to be in the wrong order for convenient positioning.

The E4D mark is the ring that should be above the base and the E49 should be above that, but because they apply in order the E49 ends up underneath the ring.

Yet somehow the browser gets it right??

mikeday commented 7 years ago

HarfBuzz produces the correct output, somehow.

khaledhosny commented 7 years ago

Would it the result of HarfBuzz applying canonical ordering to the marks?

devongovett commented 7 years ago

You could try using string.normalize() before passing to fontkit to test that.

It could also be that Harfbuzz has a dedicated Thai shaper whereas fontkit passes it to USE...

khaledhosny commented 7 years ago

Does not seem to be normalization, but rather decomposing and reordering before applying the features, see this comment in HarfBuzz Thai shaper. There does not seem to be a Thai shaper in fontkit at all.

mikeday commented 7 years ago

Thanks, that's very helpful!

"The following is NOT specified in the MS OT Thai spec, however, it seems to be what Uniscribe and other engines implement."

It would be great to push some of these compatibility issues into the specs so that they don't have to be reverse engineered each time, like trying to implement HTML parsing in the '90s.

khaledhosny commented 7 years ago

I don’t know if the script shaper are formally part of the OpenType spec.

cc @PeterConstable, @behdad

mikeday commented 7 years ago

In that case perhaps we need new specs; it is intolerable that in 2017 an interoperable web user-agent cannot be developed without reverse engineering the Microsoft implementation.

What can we do to assist?

behdad commented 7 years ago

I'm willing to help, if there's any interest from @PeterConstable.

prjnt commented 7 years ago

This behaviour is at least briefly described at the bottom of https://www.microsoft.com/typography/otfntdev/thaiot/shaping.aspx : "The AM[-class] character [i.e. U+0E33] needs to be broken into two glyphs and some reordering might be required so that the ring is the base glyph"

mikeday commented 7 years ago

Okay so there is a preprocessing step where the character is decomposed at the Unicode level instead of waiting until later to apply to the glyph substitution:

MARK SARA_AM => NIKAHIT MARK SARA_AA

SARA_AM := U+0E33 SARA_AA := U+0E32 NIKAHIT := U+0E4D MARK := U+0E31 | U+0E34..0E37 | U+0E47..0E4E

(And similarly for the Lao script).

behdad commented 7 years ago

I think @PeterCon can fix the Thai OpenType spec to reflect this.

mikeday commented 7 years ago

Yes that would be helpful.

A comment in HarfBuzz also notes that Uniscribe "positions U+0E3A after U+0E38 and U+0E39", which HarfBuzz achieves via modifying the combining class before reordering, I think.

Perhaps this can be described like this:

PHINTHU ( SARA_U | SARA_UU )+ => ( SARA_U | SARA_UU )+ PHINTHU

SARA_U := U+0E38 SARA_UU := U+0E39 PHINTHU := U+0E3A