Feature: Use muan/emojilib dataset

FredHappyface commented 8 months ago

Background

I've fallen down a bit of a rabbit hole as I've been looking for ways to search emoji by plaintext which often calls for the use of aliasses to do so. For example: '🤗' isn't called 'hug' however, this is a useful alias (which this lib supports)

I innstalled element on my phone a few months ago for discussion on another project and saw that is has a really great search functionality for a ton of aliases that are not present in this lib For example, ':)' for '😊'. And started digging

Basically, they use the following python script during build to fetch the latest emoji and aliasses from a few other sources https://github.com/element-hq/element-android/blob/def2a8a83351c06cb65fdbd4d483ac811329b023/tools/import_emojis.py#L20

One of these is the dataset available from https://github.com/muan/emojilib which seems really good for this

The questions/ feature request

Would you accept a pr to add the aliasses from muan/emojilib to this project?

Also, I noticed that the demojise function only exposes the first alias if available so I've written my own implementation for a lib that returns a underscore seperated string of keywords. Is such a function (maybe called get_aliases) something you'd accept a pr for?

Thanks for your time and for the awesome project

cvzi commented 8 months ago

I think the aliases that are listed in muan/emojilib are too broad for this project. For example search for ":D" in muan/emojilib and it actually refers to three different emoji, but this library requires unique aliases.

cvzi commented 8 months ago

I don't know if you saw this, but our database is actually just a python-dict. We recently needed to compress the dict into a single line, which makes it unreadable, but you can look at an older version to see how everything is stored: https://raw.githubusercontent.com/carpedm20/emoji/f14ece8475a1f2323326a4b850a209509310e470/emoji/unicode_codes/data_dict.py (Look for the key 'alias')

Extending the dict with custom aliases is possible during runtime. See https://github.com/carpedm20/emoji/issues/268#issuecomment-1620444814 on how to add a single alias. So it would be easy to just load the JSON file from muan/emojilib and add aliases.

Regarding demojize, there is also the function replace_emoji which can be used to do what you want:

def repl(emj, emj_data):
    name_list = [emj_data['en']]
    if 'alias' in emj_data:
      name_list += emj_data['alias']
    # Here you could also add aliases from muan/emojilib
    # just look up `emj` in their json data
    return "_".join(name_list)

print(emoji.replace_emoji('Test 🤗', replace=repl))
# Outputs: Test :smiling_face_with_open_hands:_:hugging_face:_:hugs:

# In the repl function:
# emj = "🤗"
# emj_data = {
#   'match_start': 5,
#   'match_end': 6
#   'en': ':smiling_face_with_open_hands:', 
#   'status': 2, 
#   'E': 1, 
#   'alias': [':hugging_face:', ':hugs:'], 
#   'de': ':gesicht_mit_umarmenden_händen:',
#   'es': ':cara_con_manos_abrazando:',
#   ...}

FredHappyface commented 8 months ago

Oh awesome stuff! Thank you so much for that! I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?

Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?

Thanks again for your help :)

cvzi commented 8 months ago

I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?

Did you already do anything or is it just a plan at this point?

I am not so sure it is feasible. As I said, the aliases need to be unique, one alias can only belong to one emoji. For each alias that has multiple meanings in muan/emojilib you would have to decide to which emoji it should belong. Presumably you would have to do this manually for each emoji.

Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?

It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference (at least with newer Python versions). I am thinking about moving to JSON and also splitting the file into several smaller files. I recently did a comparison between the python-dict and JSON: https://github.com/carpedm20/emoji/issues/280#issuecomment-1950421563

cvzi commented 8 months ago

FYI there is a proposed major change in keywords in muan/emojilib, see https://github.com/muan/emojilib/issues/194 and https://github.com/muan/emojilib/pull/226

I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords and not in alias in the EMOJI_DATA. For example for 🤗:

    '\U0001F917': {
        'en': ':smiling_face_with_open_hands:',
        'status': fully_qualified,
        'E': 1,
        'keywords': ['hugging_face', 'face', 'smile', 'hug'],
        'alias': [':hugging_face:', ':hugs:'],
        'de': ':gesicht_mit_umarmenden_händen:',
        'es': ':cara_con_manos_abrazando:',
        ...
    },

An then add a function to retrieve them - as you suggested get_aliases or something like that - and a function to search by keyword. This would widen the functionality of this library to searching for emoji, but it wouldn't interfere with the original functionality i.e. emojize()/demojize()

Btw we also use a script to update the emoji and aliases. The aliases specifically are added here: https://github.com/carpedm20/emoji/blob/ceddc11675be53eb1a8907dc6cb2fc0fd214f548/utils/get_codes_from_unicode_emoji_data_files.py#L586-L601 Adding a new entry keywords to this script would be simple.

FredHappyface commented 3 months ago

Hi thanks for getting back to me on this and apologies for the silence for a few months

Did you already do anything or is it just a plan at this point?

I've not had the oppotunity to look at this unfortunately, happy to help however I can though - appreciate this may be too little too late - as I should have some more free time

It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference

Makes sense tbh, and yeah I wonder if that'll help with some of the maintaining stuff? But yeah the perf improvements in python have helped a lot. I guess one option is a ci/cd step which glues together a load of json files and wraps them in some python for the best of both worlds?

I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords and not in alias in the EMOJI_DATA Yeah I think that makes a lot of sense actually, that way it's less likely to break existing users too which is always a bonus!

Thanks again :)

cvzi commented 2 months ago

I forgot about this project, I had started implementing a JSON solution in April. One JSON file for each language and the possibility to extend it with custom data like this emojilib. As far as I remember my implementation was almost ready, I'll try to find some time in the next weeks for a pull request.

cvzi commented 2 months ago

I just realized that extending the database with custom data is not as simple as a I thought. The problem is that at the moment the database is global: if you add custom data, this will not just affect your own code but also any third-party library that depends on the emoji library.

I guess the solution is to use a class/object to keep separate databases, something like this (pseudo code):

emoji_config = emoji.new_emoji_instance()     # Create a new copy of the database
emoji_config.extend_database(custom_aliases)  # Modify the new database
emoji_config.emojize(':a_custom_alias:')      # Use emojize/demojize with the new database

carpedm20 / emoji

Feature: Use muan/emojilib dataset #286

Background

The questions/ feature request