MicrosoftDocs / typography-issues

Creative Commons Attribution 4.0 International
47 stars 21 forks source link

[USE] Specify that invalid cluster 25CC is inserted between GSUB and GPOS application #281

Open xadxura opened 5 years ago

xadxura commented 5 years ago

In this section, we need to clarify the 25CC is inserted by the engine after GSUB and before GPOS

Handling invalid combining marks Combining marks and signs that do not occur in conjunction with a valid base are considered invalid. USE treats an invalid mark as a separate cluster and displays the stand-alone mark positioned on a dotted circle (U+25CC). If multiple marks are required to position on a dotted circle, the dotted circle can be explicitly inserted into the text stream followed by any marks in accordance with the standard clustering rules.

To allow for shaping engine implementations that expect to position an invalid mark on a dotted circle, it is recommended that font using USE contain glyphs for the dotted circle character, U+25CC. If this character is not supported in the font, such implementations will display invalid signs on the missing glyph shape (white box).

behdad commented 5 years ago

In HarfBuzz we do it at the beginning of reordering phase.

behdad commented 5 years ago

Doing it then makes a lot more sense to me, as reordering without a base makes little sense.

NorbertLindenberg commented 5 years ago

I agree with Behdad that the dotted circle has to be inserted before reordering. In my observation with Balinese, CoreText and DirectWrite agree with HarfBuzz as well: a stand-alone pre-base vowel is rendered before the inserted dotted circle by all three engines.

A second practical reason why doing it after completing GSUB application is too late: Multi-script fonts may need to replace dotted circles with script-specific versions, because marks may need to attach at different anchor positions per script or because bases in one script have their baseline shifted relative to those of other scripts.

My interpretation of the “Defective clusters” section of the existing document text actually is that the dotted circle is inserted as part of validation, that is, before any feature application.

NorbertLindenberg commented 5 years ago

@xadxura Could you add this footer to your description so that the issue shows up on the USE documentation page? Thanks.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

lianghai commented 5 years ago

I agree that dotted circles can be inserted during GSUB as long as that happens after the “basic cluster formation GSUB” stage. “[A]t the beginning of reordering phase” sounds appropriate.

We need to prevent unintentional behavior of the inserted dotted circle like this: https://github.com/n8willis/opentype-shaping-documents/issues/76, but it’s a real glyph in the glyph sequence and should not just avoid any GSUB.

I don’t think the dotted circle “has to be inserted before reordering” though, as if necessary, a shaper can first insert a placeholder/surrogate of the dotted circle for reordering purpose.

A second practical reason why doing it after completing GSUB application is too late: Multi-script fonts may need to replace dotted circles with script-specific versions, because marks may need to attach at different anchor positions per script or because bases in one script have their baseline shifted relative to those of other scripts.

@NorbertLindenberg: Mmm, we need better use cases. These two sound like something that should be done in GPOS. Different scripts’ marks (actually, not necessarily from different scripts), when need to be positioned on a base differently, should simply have differentiated anchors. A script/languagesystem-controlled GPOS should shift the dotted circle glyph for the needs of a variable baseline. Using GSUB for such needs sounds like a hack…

NorbertLindenberg commented 5 years ago

Validation happens on the input Unicode character sequence, or on a glyph sequence equivalent to it. If it finds that the input is invalid, why should it not insert the dotted circle right then and there to guarantee all feature code a valid glyph sequence? What’s the point of delaying the insertion or messing with invisible placeholders?

behdad commented 5 years ago

If it finds that the input is invalid, why should it not insert the dotted circle right then and there to guarantee all feature code a valid glyph sequence? What’s the point of delaying the insertion or messing with invisible placeholders?

I agree with Norbert. Moving the insertion in the middle of GSUB/GPOS is just wrong.

lianghai commented 5 years ago

Norbert seems to suggest the dotted circle should be inserted right after cluster validation—which strikes me as prone to issues like https://github.com/n8willis/opentype-shaping-documents/issues/76 —unless we make a distinction between encoded dotted circles and inserted dotted circles for the “basic cluster formation GSUB”.

Behdad has been pointed out that HB does the insertion between the “basic cluster formation GSUB” and the reordering phase, which does seem appropriate, as long as there’s no more shaper-level Indic-shaping magic happing after reordering. But Behdad also said he agreed with Norbert’s latest comment, which is confusing…

NorbertLindenberg commented 5 years ago

Looking at n8willis/opentype-shaping-documents#76, it seems the issue is primarily that HarfBuzz inserts the dotted circle in the wrong place. Let’s assume that gets fixed, i.e., the dotted circle is inserted after (not before) ra halant ZWJ?. How does it then make a difference whether the dotted circle is inserted before or after GSUB application, or between the Basic Shaping and Presentation Forms phases? I’m not familiar with Bengali shaping, so it’s not obvious to me. And would it make a difference in the USE, which applies less magic?

lianghai commented 5 years ago

Mmm we need to roll back a bit—

First note there’s a disagreement between the USE spec and HB in terms of what to do with a character sequence <virama, letter> (of a script that has post-base conjoining forms; without the base letter of the cluster):

I’m not sure that HB’s behavior is appropriate (in terms of Unicode and OTL standardization’s concerns)—although it seems handy. And this is certainly not what Uniscribe/DirectWrite does.

behdad commented 5 years ago

Norbert seems to suggest the dotted circle should be inserted right after cluster validation—which strikes me as prone to issues like n8willis/opentype-shaping-documents#76 —unless we make a distinction between encoded dotted circles and inserted dotted circles for the “basic cluster formation GSUB”.

No. As Norbert points out, that bug is HarfBuzz being wrong. That has nothing to do with the approach in general.

Behdad has been pointed out that HB does the insertion between the “basic cluster formation GSUB” and the reordering phase, which does seem appropriate, as long as there’s no more shaper-level Indic-shaping magic happing after reordering. But Behdad also said he agreed with Norbert’s latest comment, which is confusing…

In Indic shaper we do it at the beginning of initial-reordering. That's after ccmp/loca applied. Doesn't have to be. It just happens to be there. In USE it's also at beginning of reordering. I'm fine with moving it earlier. It just happens to be written that way.

  • USE spec says <virama, letter> becomes two clusters and on their own, and a dotted circle is inserted for because it’s a dependent sign on its own.
  • While HB inserts dotted circle so early (it’s not “at the beginning of reordering phase” after all?) that there is only a single cluster of glyph sequence <dotted circle, virama, letter>, and the dotted circle is a valid base then this cluster becomes glyph sequence <dotted circle, post-base conjoining form>.

I’m not sure that HB’s behavior is appropriate (in terms of Unicode and OTL standardization’s concerns)—although it seems handy. And this is certainly not what Uniscribe/DirectWrite does.

Right. I prefer HB's. I know @jfkthame and @mhosken have also prefered HB's. I like that to be standard, or at least the standard not requiring one way or another.

behdad commented 5 years ago

I do think HB should move that insertion earlier.

NorbertLindenberg commented 5 years ago

I ran some tests with a modified Balinese font that applies contextual substitutions involving dotted circle in different features:

feature ccmp {
    script bali;
    sub dottedCircle suku-bali' by sukuIlut-bali;
} ccmp;

feature blwf {
    script bali;
    sub dottedCircle' sukuIlut-bali by na-bali;
} blwf;

feature pres {
    script bali;
    sub dottedCircle ulu-bali' by uluSari-bali;
} pres;

Result: Both CoreText and DirectWrite apply all three features when given the dependent vowels suku and ulu without a base. This means, they insert the dotted circle before OpenType feature application I. Only HarfBuzz skips the ccmp and blwf features because it inserts the dotted circle later.

I think we should make insertion before GSUB application the standard.

[Tested with Safari 12.1.2 on macOS 10.14.6; Chrome 76.0.3809.100 on macOS 10.14.6; Edge 44.18362.267.0 on Windows 10 1903.]

NorbertLindenberg commented 5 years ago

First note there’s a disagreement between the USE spec and HB in terms of what to do with a character sequence <virama, letter> (of a script that has post-base conjoining forms; without the base letter of the cluster):

  • USE spec says <virama, letter> becomes two clusters and on their own, and a dotted circle is inserted for because it’s a dependent sign on its own.
  • While HB inserts dotted circle so early (it’s not “at the beginning of reordering phase” after all?) that there is only a single cluster of glyph sequence <dotted circle, virama, letter>, and the dotted circle is a valid base then this cluster becomes glyph sequence <dotted circle, post-base conjoining form>.

I’m not sure that HB’s behavior is appropriate (in terms of Unicode and OTL standardization’s concerns)—although it seems handy. And this is certainly not what Uniscribe/DirectWrite does.

That’s not about when the dotted circle is actually inserted, but about the error recovery that the tokenizer uses once it has decided that a dotted circle needs to be inserted. I filed a separate ticket #289 about that.

behdad commented 5 years ago

Result: Both CoreText and DirectWrite apply all three features when given the dependent vowels suku and ulu without a base. This means, they insert the dotted circle before OpenType feature application I. Only HarfBuzz skips the ccmp and blwf features because it inserts the dotted circle later.

Please file an issue for HarfBuzz to fix this. Would be great if you can also contribute a test font. Thanks!

xadxura commented 5 years ago

If we insert dotted circle after reordering and before per-run features, then the dotted circle indicating a problem can be substituted away very easily. The purpose of the dotted circle is to communicate that the cluster is invalid according to current Unicode properties. If we allow the dotted circle to be removed then there is no point in having properties and the ability to effectively validate script clusters is removed. This may not be what the user of the font wants or intends, even though they may appreciate not having visibly broken clusters. As a result, they may produce text that looks fine with some fonts but is defective with others.

Now I well understand that the ability to substitute out the dotted circle is precisely what some would like in order to be able to work around invalid clusters caused by Unicode properties that are incorrect or where the USE cluster model is incomplete (e.g., Tai Tham). However, working around the issue will not fix the issue and will not guide the community to having conforming fonts, rather it will encourage and allow proliferation of fonts that circumvent the problem. Going for the short term work-around rather than doing the hard work (which, unfortunately can take years), I think, would actually mean solutions take longer because there is no continuing issue to encourage those involved with a particular script to push for a solution that will meet the needs of the script and be conformant with Unicode. So I think the correct place to do U+25CC injection is after GSUB before GPOS.

Positioning multiple marks on a dotted circle is desirable in some cases and is possible by explicitly adding U+25CC to the character run. The exception to that is the case Norbert has raised in which U+25CC (and the GB class) is currently not treated as a Base, and therefore cannot be subjoined. I think that's a reasonable adjustment to make. Therefore, let's update ISC so that: 25CC ; Consonant # So DOTTED CIRCLE

Richard57 commented 5 years ago

The big problem with the dotted circle that is inserted between combining marks, e.g. in Tibetan in NFC. What in current Unicode properties even declares that at most one canonical class of permutations of a set of combing marks constitutes a valid cluster when placed after a consonant? In the Unicode scheme, if a cluster is valid, then permuting its combining marks does not render it invalid. Remember that the behaviour of the USE does not drive Unicode properties.

The problem is not non-conforming fonts; it is non-conforming rendering engines.

Taking action to preventing fonts from curing the problem of misapplied dotted circles will prevent people from using their traditional writing system with a standardised encoding, and is therefore contrary to the UK's Equality Act 2010.

mikeday commented 5 years ago

We have been looking at USE for clues about how to specify dotted circle insertion for the Indic2 shaping model (https://github.com/n8willis/opentype-shaping-documents/issues/76) and this thread is helping to clarify the considerations behind how it is supposed to work.

If I am understanding the proposal correctly, the idea is to delay inserting the U+25CC glyph until after all of the GSUB substitutions have been applied, so that the font cannot (helpfully) attempt to remove or modify this glyph and thus change the behaviour of the shaper by "fixing" broken clusters.

This sounds like it could be achieved by inserting an immutable placeholder during cluster analysis, before GSUB, to take the place of the missing base consonant during the subsequent processing and reordering stages. The placeholder could simply be the U+25CC glyph itself, annotated with a flag to indicate that it will not participate in substitutions, although exactly how it is implemented should not matter.

@xadxura, does this describe what you are suggesting?

We are also wrestling with some related issues, such as how dotted circle insertion should treat decomposed matras:

Take, for example, <U+09CB ো BENGALI VOWEL SIGN O>, which decomposes into <U+09C7 ে BENGALI VOWEL SIGN E, U+09BE া BENGALI VOWEL SIGN AA>. If I take the spec's recommendation to "form separate clusters for each mark", this would mean I'd get <Sign E, Dotted Circle> and <Dotted Circle, Sign Aa> post-reorder.

and also how dotted circle insertion should interact with explicit/implicit reph formation, so it would be great to discuss this topic further; we would like to have a principled design that isn't simply a side-effect of how we implement the shaping of valid syllables.

mhosken commented 5 years ago

There are regularly use cases where, due to a problem in the encoding model of a script, dotted circles are inserted by the shaping engine, when they should not. The problem is that after such issues are reported and resolved, previous versions of engines are not updated and along with them the applications that depend on them. If the dotted circle could be substituted away, fonts could be created that fix the problems of older systems. The argument that we can fix all known problems now and so not need this facility, is wrong in that we are never going to get it all completely right and that there are always going to be problems that it would be good to work around. Therefore, I propose that the USE on all implementations should provide the facility to substitute the error dotted circle away where needed.

behdad commented 5 years ago

I agree it's best to get out of the font's way.

NorbertLindenberg commented 5 years ago

The solution I proposed would allow the font to substitute any dotted circle away, but @dscorbett argues in harfbuzz/harfbuzz#1924 that that’s not enough – fonts also need the ability to distinguish between a dotted circle that’s present in the input text and a dotted circle inserted by the shaping engine. @mhosken and @Richard57, is that necessary based on your experience?

Richard57 commented 5 years ago

While intended dotted circles usually occur at the start of a run of word characters, I have seen them follow word characters, for example in the illustration of Thai vowel symbols used only in closed syllables. Such dotted circles are at risk of being misinterpreted as erroneous.

There's also the standards compliance issue. Dotted circles in the backing store should not be deleted by the font, while dotted circles wrongly inserted by the shaping engine are fair game for correction. I therefore consider being able to distinguish the two types of dotted circle highly desirable.

mhosken commented 5 years ago

If dotted circles are inserted during the syllable analysis phase, a font may convert dotted circles in the underlying data into a different glyphid during one of the pre-analysis features and so provide themselves with a separation of concerns for later features.