Open mikeday opened 7 years ago
The text in question is: U+0E19 U+0E49 U+0E33
Converted to glyph indices: 30, 94, 56 (uni0E19, uni0E49.small, uni0E33)
After applying a ccmp lookup: 30, 94, 78, 55 (uni0E19, uni0E49.small, uni0E4D, uni0E32)
Then one more ccmp lookup: 30, 74, 78, 55 (uni0E19, uni0E49, uni0E4D, uni0E32)
So the last character has been decomposed and now there is a base followed by two marks and one more base which we can ignore.
There are no more GSUB substitutions and most importantly none that reorder the two marks, which appear to be in the wrong order for convenient positioning.
The E4D mark is the ring that should be above the base and the E49 should be above that, but because they apply in order the E49 ends up underneath the ring.
Yet somehow the browser gets it right??
HarfBuzz produces the correct output, somehow.
Would it the result of HarfBuzz applying canonical ordering to the marks?
You could try using string.normalize() before passing to fontkit to test that.
It could also be that Harfbuzz has a dedicated Thai shaper whereas fontkit passes it to USE...
Does not seem to be normalization, but rather decomposing and reordering before applying the features, see this comment in HarfBuzz Thai shaper. There does not seem to be a Thai shaper in fontkit at all.
Thanks, that's very helpful!
"The following is NOT specified in the MS OT Thai spec, however, it seems to be what Uniscribe and other engines implement."
It would be great to push some of these compatibility issues into the specs so that they don't have to be reverse engineered each time, like trying to implement HTML parsing in the '90s.
I don’t know if the script shaper are formally part of the OpenType spec.
cc @PeterConstable, @behdad
In that case perhaps we need new specs; it is intolerable that in 2017 an interoperable web user-agent cannot be developed without reverse engineering the Microsoft implementation.
What can we do to assist?
I'm willing to help, if there's any interest from @PeterConstable.
This behaviour is at least briefly described at the bottom of https://www.microsoft.com/typography/otfntdev/thaiot/shaping.aspx : "The AM[-class] character [i.e. U+0E33] needs to be broken into two glyphs and some reordering might be required so that the ring is the base glyph"
Okay so there is a preprocessing step where the character is decomposed at the Unicode level instead of waiting until later to apply to the glyph substitution:
MARK SARA_AM => NIKAHIT MARK SARA_AA
SARA_AM := U+0E33 SARA_AA := U+0E32 NIKAHIT := U+0E4D MARK := U+0E31 | U+0E34..0E37 | U+0E47..0E4E
(And similarly for the Lao script).
I think @PeterCon can fix the Thai OpenType spec to reflect this.
Yes that would be helpful.
A comment in HarfBuzz also notes that Uniscribe "positions U+0E3A after U+0E38 and U+0E39", which HarfBuzz achieves via modifying the combining class before reordering, I think.
Perhaps this can be described like this:
PHINTHU ( SARA_U | SARA_UU )+ => ( SARA_U | SARA_UU )+ PHINTHU
SARA_U := U+0E38 SARA_UU := U+0E39 PHINTHU := U+0E3A
Hi, if you try the NotoSansThai-Regular.ttf font with the fontkit-demo and the following text:
น้ำ
The ring mark will be positioned above the other mark, when it should be below. The browser correctly positions the marks in the opposite order, but honestly I don't know how! We have the same bug in Prince and I've checked the font quite carefully and can't see what's going on, have you encountered anything like this before?