Thoughts on supporting multiple versions of unicode? (for grapheme clusters)

In wezterm, in order to deal with a frankly very messy state of text in the terminal ecosystem, we have the ability to switch to different versions of unicode at runtime for different regions of output (https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html).

Currently this affects only width determination and emoji presentation, but earlier today I was looking at a situation with 👩‍🚒 where wezterm was treating it as a single grapheme but the shell considered it to be two separate codepoints.

I'm making an assumption that using an older version of unicode data would produce the same result as the shell. If that assumption is true:

I'd like to be able to choose, say, unicode 8, segmentation at runtime, perhaps via a constructor parameter to Graphemes, so that wezterm can adjust its segmentation to match the unicode version setting.

In order to support that in finl_unicode, I think your codegen could be relatively easily modified to iterate over a range of versions and output different tables for them. I think you'd probably want that gated behind a feature flag (perhaps even one per version?) as I doubt most people would want to do this.

Are you open to having something like this in your crate?

I don’t think this is a unicode version issue, but rather that as far as I know, no shell is grapheme aware. If, for example, you type 🇧🇫 into any shell, and then hit delete, it will delete only 🇫and leave 🇧 in its stead. You would need to revert to Unicode 5.0.0 to no longer have flags. Since the Unicode standard is rigorously backwards-compatible, I’m not sure that reverting to old versions of Unicode is ever the correct behavior. I would note also that 👩‍🚒 is represented by 👩 + ZWJ + 🚒 and its rendering as 👩‍🚒 is independent of the Unicode version: Any pair of Emojis separated with ZWJ will be treated as a single grapheme in all recent versions of Unicode.

dahosek / finl_unicode

Thoughts on supporting multiple versions of unicode? (for grapheme clusters) #10