carpedm20 / emoji

emoji terminal output for Python
Other
1.87k stars 273 forks source link

Should the variation selector be considered an emoji? #262

Closed marmistrz closed 1 year ago

marmistrz commented 1 year ago

Consider the following unicode sequence, representing a single emoji. ๐Ÿ˜€๏ธŽ.

import emoji
s = b'\\U0001f600\\ufe0f'.decode('unicode_escape')
for c in s:
    print(emoji.is_emoji(c))

I would expect this code to print True twice. However, the actual behavior is:

True
False

Is this the expected behavior?

cvzi commented 1 year ago

Can you elaborate why you would expect it to be considered an emoji?

Emoji are generally graphical symbols. The emoji selector character on its own is invisible. -- Enviado desde mi dispositivo Android con K-9 Mail. Por favor, disculpa mi brevedad.

marmistrz commented 1 year ago

My intent is to detect strings which represent sequences of emojis only (so that I can use a different font size). This means that I want to return YES for strings like:

๐Ÿ˜€๏ธŽ
๐Ÿ˜€๏ธŽ๐Ÿ˜€๏ธŽ๐Ÿ˜€๏ธŽ
๐Ÿคš๐Ÿฟ

but NO for

something๐Ÿ˜€๏ธŽ
๐Ÿ˜€๏ธŽdfaffd๐Ÿ˜€๏ธŽs๐Ÿ˜€๏ธŽ

Also, I'd expect a YES answer for the example from the OP, '\\U0001f600\\ufe0f'.

The natural approach to do this is

return all(emoji.is_emoji(c) for c in text)

which will return False for the example from the OP. The fact that this doesn't work if a variation selector is used is far from being obvious and I can expect a lot of code using my buggy approach in the wild.

Since emoji.is_emoji only works on the character level, I'll be happy to contribute a helper function that would properly recognize whether a string contains only emojis.

cvzi commented 1 year ago

There is the new function analyze() [added in the current version] , it can split a string into non-emojj characters and multi-character emoji.

I think your use-case can be achieved with this:

all(isinstance(m.value, emoji.EmojiMatch) for m in emoji.analyze(my_string, non_emoji=True))

To understand how it works look at the output of:

list(m.value for m in emoji.analyze(my_string, non_emoji=True))

-- Enviado desde mi dispositivo Android con K-9 Mail. Por favor, disculpa mi brevedad.

cvzi commented 1 year ago

Maybe we should have a warning in the Readme that explains that it is generally a bad idea to look at individual characters when dealing with emoji ๐Ÿค”

-- Enviado desde mi dispositivo Android con K-9 Mail. Por favor, disculpa mi brevedad.

marmistrz commented 1 year ago

Given how error prone it is, how about exposing this functionality in emoji itself?

cvzi commented 1 year ago

I guess we could add it. Do you have a good name for a function in mind?

marmistrz commented 1 year ago

Some ideas:

marmistrz commented 1 year ago

If you have decided upon what the name should be, let me know and I'll submit a PR.

cvzi commented 1 year ago

Sorry, I forgot about it.

Not 'sequence' because 'sequence' has a different meaning in the context of Unicode.org.

The others I don't really have a preference. pure_emoji is short and , so maybe that one.

marmistrz commented 1 year ago

I used purely_emoji because pure_emoji might mislead the user that there are pure and impure emojis.