dahosek / finl_unicode

Unicode support for the finl project
Apache License 2.0
13 stars 7 forks source link

Legacy segmentation? #9

Closed pickfire closed 1 year ago

pickfire commented 1 year ago

I saw finl_unicode from https://www.finl.xyz/2022/08/29/announcing-finl_unicode-1-0-0/ and found the performance improvements interesting. I am looking to see if finl_unicode is good to replace unicode-segmentation and unicode-general-category in https://github.com/helix-editor/helix but I there are some parts I don't quite understand in the README.

I also do not support legacy clustering algorithms which are supported by unicode-segmentation.

What does legacy clustering algorithms mean? Does that impact segmentation on some languages? Can the performance benefit in finl_unicode be sent to upstream unicode-* so those crates have the improvements as well?

dahosek commented 1 year ago

On legacy clustering, per UAX29,

the extended grapheme cluster boundaries are recommended for general processing, while the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification.

It does appear that it does impact segmentation in some languages: In particular, spacing marks are considered part of the grapheme with extended segmentation, as are 26 prepend characters (it appears, incidentally, that the example given of กำ being a single cluster in Thai is incorrect (which, from my limited knowledge of Thai, I would expect to be the case)). Most of the impact is on South-Asian languages where the equivalent sequence of characters (the Thai abugida shares a common ancestor with the Indic scripts) would be considered a single grapheme.

As for whether the performance benefits can be sent upstream, finl_unicode isn’t really downstream from the other crates: I wrote my implementation from scratch, avoiding looking at the other code (other than to be occasionally horrified when I looked at the implementation of unicode-categories. I somehow missed unicode-general-category when I was writing my code. I would expect from the description, however, that it’s comparable in features and functionality to finl_unicode)

I think that there might be some issues with incorporating into helix, though: only forward-iteration is possible and there is no equivalent to cursor (I didn’t need either functionality).