alvinlindstam / grapheme

A python package for grapheme aware string handling
MIT License
107 stars 7 forks source link

Documentation out of sync for `grapheme.graphemes` call #10

Open EmilStenstrom opened 4 years ago

EmilStenstrom commented 4 years ago

The documentation gives this code:

>>> rainbow_flag = "🏳️‍🌈"
>>> [codepoint for codepoint in rainbow_flag]
['🏳', '️', '‍', '🌈']
>>> list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))
['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', '🏳️‍🌈']

In reality, this is how the same code runs locally using Python 3.8, in the default Mac OS Terminal:

>>> rainbow_flag = "🏳️‍🌈"
>>> [codepoint for codepoint in rainbow_flag]
['🏳', '️', '\u200d', '🌈']
list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))
['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', '🏳️\u200d🌈']

I would expect the flag emoji to be held together as one character, like in the documentation.

alvinlindstam commented 4 years ago

This is interesting, I'm wondering when it changed. I'm quite sure that the documentation code has been the actual output when I originally wrote it.

I consider this a documentation bug, in that it does not really show what the function does in a good way. The function does keep the rainbow flag intact as one character/grapheme, the issue is that repr (which is what's used to control the display of the value in the command prompt) of that string returns that not very useful string:

>>> print(rainbow_flag)
🏳️‍🌈
>>> print(repr(rainbow_flag))
'🏳️\u200d🌈'
>>> rainbow_flag
'🏳️\u200d🌈'
>>> repr(rainbow_flag)
"'🏳️\\u200d🌈'"
>>> rainbow_flag.encode('unicode-escape')
b'\\U0001f3f3\\ufe0f\\u200d\\U0001f308'

It should be the case that list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))[-1] == rainbow_flag in your snippet.

I'll see if I can understand why repr does this, and if I can find a different multi-scalar grapheme cluster that can be used instead in the demo that does not look weird using repr. Input on that is appreciated.

EmilStenstrom commented 4 years ago

@alvinlindstam Happy to hear it's only a documentation bug. I'm afraid I have no idea either when repr changed, or what a better multi-scalar grapheme would be.

Also: Thanks for this library, it was just what I needed to build my "convert datetimes across time zones with emoji"-library ;) Reference: https://github.com/EmilStenstrom/emojizones/blob/master/emojizones/convert.py#L84