Open FredHappyface opened 8 months ago
I think the aliases that are listed in muan/emojilib are too broad for this project. For example search for ":D" in muan/emojilib and it actually refers to three different emoji, but this library requires unique aliases.
I don't know if you saw this, but our database is actually just a python-dict. We recently needed to compress the dict into a single line, which makes it unreadable, but you can look at an older version to see how everything is stored: https://raw.githubusercontent.com/carpedm20/emoji/f14ece8475a1f2323326a4b850a209509310e470/emoji/unicode_codes/data_dict.py (Look for the key 'alias')
Extending the dict with custom aliases is possible during runtime. See https://github.com/carpedm20/emoji/issues/268#issuecomment-1620444814 on how to add a single alias. So it would be easy to just load the JSON file from muan/emojilib and add aliases.
Regarding demojize, there is also the function replace_emoji
which can be used to do what you want:
def repl(emj, emj_data):
name_list = [emj_data['en']]
if 'alias' in emj_data:
name_list += emj_data['alias']
# Here you could also add aliases from muan/emojilib
# just look up `emj` in their json data
return "_".join(name_list)
print(emoji.replace_emoji('Test 🤗', replace=repl))
# Outputs: Test :smiling_face_with_open_hands:_:hugging_face:_:hugs:
# In the repl function:
# emj = "🤗"
# emj_data = {
# 'match_start': 5,
# 'match_end': 6
# 'en': ':smiling_face_with_open_hands:',
# 'status': 2,
# 'E': 1,
# 'alias': [':hugging_face:', ':hugs:'],
# 'de': ':gesicht_mit_umarmenden_händen:',
# 'es': ':cara_con_manos_abrazando:',
# ...}
Oh awesome stuff! Thank you so much for that! I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?
Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?
Thanks again for your help :)
I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?
Did you already do anything or is it just a plan at this point?
I am not so sure it is feasible. As I said, the aliases need to be unique, one alias can only belong to one emoji. For each alias that has multiple meanings in muan/emojilib you would have to decide to which emoji it should belong. Presumably you would have to do this manually for each emoji.
Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?
It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference (at least with newer Python versions). I am thinking about moving to JSON and also splitting the file into several smaller files. I recently did a comparison between the python-dict and JSON: https://github.com/carpedm20/emoji/issues/280#issuecomment-1950421563
FYI there is a proposed major change in keywords in muan/emojilib, see https://github.com/muan/emojilib/issues/194 and https://github.com/muan/emojilib/pull/226
I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords
and not in alias
in the EMOJI_DATA. For example for 🤗:
'\U0001F917': {
'en': ':smiling_face_with_open_hands:',
'status': fully_qualified,
'E': 1,
'keywords': ['hugging_face', 'face', 'smile', 'hug'],
'alias': [':hugging_face:', ':hugs:'],
'de': ':gesicht_mit_umarmenden_händen:',
'es': ':cara_con_manos_abrazando:',
...
},
An then add a function to retrieve them - as you suggested get_aliases
or something like that - and a function to search by keyword.
This would widen the functionality of this library to searching for emoji, but it wouldn't interfere with the original functionality i.e. emojize()/demojize()
Btw we also use a script to update the emoji and aliases. The aliases specifically are added here:
https://github.com/carpedm20/emoji/blob/ceddc11675be53eb1a8907dc6cb2fc0fd214f548/utils/get_codes_from_unicode_emoji_data_files.py#L586-L601
Adding a new entry keywords
to this script would be simple.
Hi thanks for getting back to me on this and apologies for the silence for a few months
Did you already do anything or is it just a plan at this point?
I've not had the oppotunity to look at this unfortunately, happy to help however I can though - appreciate this may be too little too late - as I should have some more free time
It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference
Makes sense tbh, and yeah I wonder if that'll help with some of the maintaining stuff? But yeah the perf improvements in python have helped a lot. I guess one option is a ci/cd step which glues together a load of json files and wraps them in some python for the best of both worlds?
I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords and not in alias in the EMOJI_DATA Yeah I think that makes a lot of sense actually, that way it's less likely to break existing users too which is always a bonus!
Thanks again :)
I forgot about this project, I had started implementing a JSON solution in April. One JSON file for each language and the possibility to extend it with custom data like this emojilib. As far as I remember my implementation was almost ready, I'll try to find some time in the next weeks for a pull request.
I just realized that extending the database with custom data is not as simple as a I thought. The problem is that at the moment the database is global: if you add custom data, this will not just affect your own code but also any third-party library that depends on the emoji library.
I guess the solution is to use a class/object to keep separate databases, something like this (pseudo code):
emoji_config = emoji.new_emoji_instance() # Create a new copy of the database
emoji_config.extend_database(custom_aliases) # Modify the new database
emoji_config.emojize(':a_custom_alias:') # Use emojize/demojize with the new database
Background
I've fallen down a bit of a rabbit hole as I've been looking for ways to search emoji by plaintext which often calls for the use of aliasses to do so. For example: '🤗' isn't called 'hug' however, this is a useful alias (which this lib supports)
I innstalled element on my phone a few months ago for discussion on another project and saw that is has a really great search functionality for a ton of aliases that are not present in this lib For example, ':)' for '😊'. And started digging
Basically, they use the following python script during build to fetch the latest emoji and aliasses from a few other sources https://github.com/element-hq/element-android/blob/def2a8a83351c06cb65fdbd4d483ac811329b023/tools/import_emojis.py#L20
One of these is the dataset available from https://github.com/muan/emojilib which seems really good for this
The questions/ feature request
Would you accept a pr to add the aliasses from muan/emojilib to this project?
Also, I noticed that the demojise function only exposes the first alias if available so I've written my own implementation for a lib that returns a underscore seperated string of keywords. Is such a function (maybe called get_aliases) something you'd accept a pr for?
Thanks for your time and for the awesome project