The Universal Shaping Engine was originally designed around a generic Brahmic cluster model, and has since been extended to support an (Egyptian) hieroglyph cluster model. Scripts that are neither Brahmic nor hieroglyphs are shoehorned into the Brahmic model by assigning their characters Indic syllabic and positional category values. These values in many cases don’t reflect the characters’s actual semantics – e.g., vowels of several scripts are classified as “Consonant”, while all seven Adlam diacritics are thrown into the “Nukta” bucket.
An additional longstanding problem is that the generic Brahmic cluster model doesn’t fit all Brahmic scripts. There have been long but so far inconclusive discussions on how to extend the model to Tai Tham, while the main Indic scripts, Thai, Lao, Khmer, and Myanmar all still require separate shaping engines.
I think it’s time to acknowledge that different scripts require different cluster models, and that this is best done by separating out several specialized cluster validation subsystems within the Universal Shaping Engine. (The engine still deserves its “Universal” title for its overall architecture and the generalized model of feature application and reordering.) Initial cluster validation subsystems would be generic Brahmic, hieroglyphs, and a “simple“ cluster model based on Unicode grapheme clusters. Subsystems could be added as needed for Tai Tham and the Brahmic scripts that are currently still handled by separate shaping engines. The list of scripts in the documentation would then have to indicate which cluster model is used for each script.
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
ID: 194a6d3c-4137-46e9-3a4b-44b990200986
Version Independent ID: a0c8e788-5228-aa28-670e-3ba1ac3faecd
The Universal Shaping Engine was originally designed around a generic Brahmic cluster model, and has since been extended to support an (Egyptian) hieroglyph cluster model. Scripts that are neither Brahmic nor hieroglyphs are shoehorned into the Brahmic model by assigning their characters Indic syllabic and positional category values. These values in many cases don’t reflect the characters’s actual semantics – e.g., vowels of several scripts are classified as “Consonant”, while all seven Adlam diacritics are thrown into the “Nukta” bucket.
An additional longstanding problem is that the generic Brahmic cluster model doesn’t fit all Brahmic scripts. There have been long but so far inconclusive discussions on how to extend the model to Tai Tham, while the main Indic scripts, Thai, Lao, Khmer, and Myanmar all still require separate shaping engines.
I think it’s time to acknowledge that different scripts require different cluster models, and that this is best done by separating out several specialized cluster validation subsystems within the Universal Shaping Engine. (The engine still deserves its “Universal” title for its overall architecture and the generalized model of feature application and reordering.) Initial cluster validation subsystems would be generic Brahmic, hieroglyphs, and a “simple“ cluster model based on Unicode grapheme clusters. Subsystems could be added as needed for Tai Tham and the Brahmic scripts that are currently still handled by separate shaping engines. The list of scripts in the documentation would then have to indicate which cluster model is used for each script.
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.