asciifaceman / rslice

Golang rune slice ([]rune) operations
MIT License
0 stars 0 forks source link

RFC 0 - Grapheme Cluster Handling #1

Open asciifaceman opened 10 months ago

asciifaceman commented 10 months ago

User dolanor@hachyderm.io on Mastodon asked:

does it handle grapheme cluster complexity?

Some initial investigation reveals that some do, such as Words() since it is counting groupings of non-whitespace characters it should typically find the boundaries just fine.

However issues present themselves in functions such as ShiftLeft and ShiftRight which will split apart graphemes that break out into a rune width greater than 1 (ex. กำ which splits into [0] and [1]

I don't know if and how I should handle/treat this situation. Ideally I should shift the entire grapheme, however detecting their width may prove troublesome. ZWC can be reasonably detected and handled, but in my quick investigation before writing this I was unable to detect how many runes I should shift to solve the above character example.

Test cases need written as well for edge cases to solve this.

Should this be planned work to support grapheme complexity? I am far from an expert on handling unicode.

Sidenote: It doesn't help that testing locally poses difficulties in just handling the characters properly in an IDE when their widths are somewhat ambiguous and they mess with the IDE cursor.

Moreover this issue may mess with some future work to handle creating multi-line []rune's processing newline characters and conforming to a width/height - the width/height is the expected RENDER size and the []rune width is n*y wider (where n is the number of complex graphemes and y is the amount of extra runes they expand into) than the rendered string. กำ can be drawn in a single cell, however it has a rune width of 2.

This is detectable with edge cases with a library such as https://github.com/mattn/go-runewidth on a whole string but not at the rune-by-rune level from what I've tested. It may involve a form of "lookahead" but I also don't know how to approach that just yet.

I am leaning towards handling graphemes but unsure how to do so.

References

asciifaceman commented 10 months ago

https://github.com/rivo/uniseg