There is no such thing as “Invalid Combining Marks” in Arabic

khaledhosny commented 3 years ago

The spec has an “Invalid Combining Marks” section but it is entirely something invented by Microsoft, there no Arabic orthographic rules or Unicode encoding requirements that dictates what a “valid consonant base” for a given mark or what mark combinations are “invalid”.

I keep regularly getting bug reports about people trying some uncommon mark combination and getting the mysterious dotted circles inserted between the marks (in Microsoft products). There are many reasons one would want to use an uncommon or unusual mark combination, it is not the engine’s rule to invent its own orthographic rules an impose it upon its users.

This is mostly a bug report against Microsoft implementation since it is the only known implementation that has this misfeature (thankfully), but since this text is present in the spec, I figure here is a goo place to report it as well.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 2296e659-a278-4505-3dbe-54e2f8e51a54
Version Independent ID: 5af9216e-b435-4a4c-3c5a-94756409f8f5
Content: Developing OpenType Fonts for Arabic Script - Typography
Content Source: typographydocs/script-development/arabic.md
Product: typography
GitHub Login: @alib-ms
Microsoft Alias: alib

roozbehp commented 3 years ago

Unicode even recommends against such invalidation for Arabic. From https://unicode.org/reports/tr53/#Dotted_circles:

“Some rendering engines will insert a dotted circle for what it understands to be an invalid sequence. This is a problem in Arabic script because something that appears invalid may actually be valid text in some lesser known orthography of a minority language or in the Quran. For example, the Microsoft Windows text rendering engine, described in [Microsoft], inserts a dotted circle in combinations of certain Quranic marks that are known to appear with each other in the Quran.

Such spell-checking processes are best implemented at a higher level than a rendering engine. Also, a dotted circle insertion algorithm that displays all canonically equivalent sequences identically is hard to design and the result may be counter-intuitive for its users.”

behdad commented 3 years ago

+1

tiroj commented 3 years ago

Indeed. The USE spec is explicit in distinguishing cluster validation from orthographic validation:

The goal of the clustering logic is to enable what is graphically consistent with a given script’s rules, rather than enforcing particular orthographic or linguistic rules. Such considerations should be applied at another layer, such as a spelling checker.

If other shaping engine implementations observed the same distinction, far fewer problems would result for users, especially people working with the vast quantities of non-standard orthographies and pre-modern texts.

MicrosoftDocs / typography-issues

There is no such thing as “Invalid Combining Marks” in Arabic #625

Document Details