dahosek / finl_unicode

Unicode support for the finl project
Apache License 2.0
15 stars 7 forks source link

Thoughts on supporting multiple versions of unicode? (for grapheme clusters) #10

Closed wez closed 2 years ago

wez commented 2 years ago

In wezterm, in order to deal with a frankly very messy state of text in the terminal ecosystem, we have the ability to switch to different versions of unicode at runtime for different regions of output (https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html).

Currently this affects only width determination and emoji presentation, but earlier today I was looking at a situation with πŸ‘©β€πŸš’ where wezterm was treating it as a single grapheme but the shell considered it to be two separate codepoints.

I'm making an assumption that using an older version of unicode data would produce the same result as the shell. If that assumption is true:

I'd like to be able to choose, say, unicode 8, segmentation at runtime, perhaps via a constructor parameter to Graphemes, so that wezterm can adjust its segmentation to match the unicode version setting.

In order to support that in finl_unicode, I think your codegen could be relatively easily modified to iterate over a range of versions and output different tables for them. I think you'd probably want that gated behind a feature flag (perhaps even one per version?) as I doubt most people would want to do this.

Are you open to having something like this in your crate?

dahosek commented 2 years ago

I don’t think this is a unicode version issue, but rather that as far as I know, no shell is grapheme aware. If, for example, you type πŸ‡§πŸ‡« into any shell, and then hit delete, it will delete only πŸ‡«and leave πŸ‡§ in its stead. You would need to revert to Unicode 5.0.0 to no longer have flags. Since the Unicode standard is rigorously backwards-compatible, I’m not sure that reverting to old versions of Unicode is ever the correct behavior. I would note also that πŸ‘©β€πŸš’ is represented by πŸ‘© + ZWJ + πŸš’ and its rendering as πŸ‘©β€πŸš’ is independent of the Unicode version: Any pair of Emojis separated with ZWJ will be treated as a single grapheme in all recent versions of Unicode.