alvinlindstam / grapheme

A python package for grapheme aware string handling
MIT License
106 stars 7 forks source link

Info regarding future and correctness of the project #13

Closed rsalmei closed 3 years ago

rsalmei commented 3 years ago

Hey man, I'm the author of alive-progress. I'm struggling to correctly support emojis in https://github.com/rsalmei/alive-progress/issues/19, and I think this project could help me.

My brute force validation ensures all chars described on that file are detected, even when concatenated with other chars. You can see in the image that it fails where: 1. two skin tones are used one after the other (I expected two graphemes, not one); 2. an ascii char followed by a skin tone and another ascii (expected three graphemes, not the skin tone of the ascii char); and 3. two ascii followed by a skin tone (same as 2. before). But it is ok, it works in the vast majority (and the regex dependency demonstrated the same results).


So, I'm thinking now about how to continue my wide chars/emoji support:

Thank you man!

alvinlindstam commented 3 years ago

Hi @rsalmei

Sorry about the delay, thanks for the questions.

Please, do you really intend to keep updating this project? For every new Unicode version?

Yes, I plan to. Upgrading unicode versions is a 15 minute job for me, assuming they don't change anything significant in the grapheme/boundary annex. I've done it for a few years and enjoyed it so far. This is a small open source project, so anything can happen with it, of course. If I disappear, the repo describes how to upgrade the unicode version if one wants to fork it or vendor it.

alvinlindstam commented 3 years ago

Performance doesn't really matter to me, since I've implemented a spinner compiler just for this, but yours seems to be fast anyway. It does not use any binary extension, do it? I'm asking because there's a cython folder with a few .c files in my site-packages...

No C extensions in this package, and currently no dependencies. I was exploring some cython extension for this to speed it up, but I wouldn't want to make it required.

There is https://pypi.org/project/PyICU/ for those prefering C-performance. There might be space for an intermediate solution which uses C extensions but doesn't depend on a non-pip dependency (ICU) being installed, not sure if it is this project though. It probably won't happen.

alvinlindstam commented 3 years ago

I'm only interested in correctness, and this one seems very nice. I've created a brute force test using the emoji-test.txt from unicode.org, and while testing several combinations of emojis, yours has only failed on the Fitz Patrick skin tone modifiers when used alone (but the unicode spec states that they should be used as a normal emoji when used alone) .. My brute force validation ensures all chars described on that file are detected, even when concatenated with other chars. You can see in the image that it fails where: 1. two skin tones are used one after the other (I expected two graphemes, not one); 2. an ascii char followed by a skin tone and another ascii (expected three graphemes, not the skin tone of the ascii char); and 3. two ascii followed by a skin tone (same as 2. before). But it is ok, it works in the vast majority (and the regex dependency demonstrated the same results).

Interesting. This project performs and passes all tests in https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakTest.txt, which is Unicode's example test for grapheme boundaries. As per the unicode 29 annex, the breaking here is "correct".

This library implements the breaking rules for Extended Grapheme Clusters (http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table). The skin tone modifiers are classified as "Extend" code points, while the ascii chars are "other". According to the "3.1.1 Grapheme Cluster Boundary Rules" section, we should never break before "Extend" codepoints, they always extend their previous code point.

Generally, broken sequences where codepoints are not combined in expected ways may cause all sorts of weirdness in any text.

I'm not sure how you get to the expectation that they should be separate graphemes based on the emoji-test data, could you elaborate?

rsalmei commented 3 years ago

Yes, I plan to. Upgrading unicode versions is a 15 minute job for me, assuming they don't change anything significant in the grapheme/boundary annex. I've done it for a few years and enjoyed it so far. This is a small open source project, so anything can happen with it, of course. If I disappear, the repo describes how to upgrade the unicode version if one wants to fork it or vendor it.

That's very nice, I'd already committed to your lib! My next major version of alive-progress will include it 👍

rsalmei commented 3 years ago

I'm not sure how you get to the expectation that they should be separate graphemes based on the emoji-test data, could you elaborate?

Yeah, of course! I've found this info on http://unicode.org/reports/tr51/, item 2.4 Diversity:

When used alone, the default representation of these modifier characters is a color swatch.

But what exactly is "using a skin modifier alone"? I've assumed it is when not being preceded by a human emoji. Then the text reasons about this, explain it better. And ends that section with:

Any other intervening character causes the emoji modifier to appear as a free-standing character. Thus

image

So it does seem that "any other intervening character" should make the skin modifier appear as a free-standing character. What do you think? Am I right to infer that two skin tones alongside and an ascii char + skin tone should both be rendered split?

alvinlindstam commented 3 years ago

Unicode and text representation will probably always be a mystery :)

I'm not sure how one is supposed to represent a standalone skin tone modifier, especially if you first include an emoji and a following modifier and then a sequence of standalone modifiers. In the example of intervening character you mentioned, they seem to be using U+200B (zero width space) to separate an emoji from the skin tone.

This does boil down to a limitation of this library and of of grapheme clusters in general though; how a text is rendered is eventually up to the text rendering entity to decide and implement, including the font designer. Grapheme clusters may be a good approximation of what a human would perceive as a textual entity given a well implemented and up to date font and renderer, but they are not guaranteed to match. The specification in annex #29 is a simplification, and there will be cases where some combination of code points is rendered into different groups than one might expect, even with grapheme awareness.

Off the top of my head:

  1. Emoji sequences not implemented by the vendor.
  2. Non-standard emoji sequences (I think Microsoft for example allow different skin tone modifiers on individuals in a family emoji sequence) rendered on other non-supporting platforms
  3. Differences in how one display widths of zero-width-items (like U+200D) in monospaced contexts. Some display them with normal width, some not.
  4. Regional indication sequences (national flags) for country codes that don't exist. According to annex 29, one should consider any pair of consecutive regional indicators as a grapheme cluster but for renderers, one should only render it as an entity if there is a font implementation of that flag.
alvinlindstam commented 3 years ago

Closing this, feels resolved

rsalmei commented 3 years ago

Yeah, thank you @alvinlindstam