Data files should probably be in a separate package.

KOLANICH commented 1 year ago

And also not in Python code, but in JSON/CSV/TSV.

cvzi commented 1 year ago

Why?

KOLANICH commented 1 year ago

To update them separately and fully automatically. For example with a CI pipeline run by "cron".

cvzi commented 1 year ago

Currently there is only one data file emoji/unicode_codes/data_dict.py. It is generated with the script in utils/get_codes_from_unicode_emoji_data_files.py. So it could be automated already. However I like to check the changed entries. The data is created manually by the Unicode contributors, so it's susceptible to errors or unexpected data.

Also the Unicode data is only updated twice per year, so at the moment it's not that much work to do it manually.

KOLANICH commented 1 year ago

I have created a tool merging some data from different sources: https://github.com/KOLANICH-tools/emojiSlugMappingGen.py

cvzi commented 1 year ago

In light of the pull request #252 adding Japanese & Korean and the recently added languages Chinese & Indonesian, I think the more important issue with the single data file is memory consumption. The import emoji statement, currently loads the whole data file (as a big dictionary) into memory. I haven't measured it recently, but I expect it is about 1-2MB of memory per language. With every new language the memory consumption will grow. Probably most users only ever use one language though.

KOLANICH commented 1 year ago

@cvzi, https://github.com/KOLANICH-tools/emojifilt.cpp uses .mo files of libintl generated by https://github.com/KOLANICH-tools/emojiSlugMappingGen.py . Currently it reads the whole files into memory (it relies on a lib that uses code compiled from Kaitai Struct spec to parse .mo files (because libintl public API is limited, so I found it easier to have an own lib), and KS currently cannot generate code relying on memory mappings (well, we can use memory mappings via std::stream interface, but it is not enough for processing files larger than RAM, for such files laying out raw structures over memory is needed) ). mo format in theory suits for using it through a memory mapping.

cvzi commented 1 year ago

I think something like that would be overkill for this library. I don't think it needs to be that efficient.

My suggestion would be to ideally keep the current API of the library and still try to reduce memory usage a little bit.

People use the big dictionary directly at the moment (it is in the public API). I think it is also nice that you can just open the file in a text editor and look at the emoji. In that way it is kind of like JSON, a human can easily read or even edit it. It is simple to add custom slugs or even custom emoji.

Maybe the language data could be in separate files and could be loaded on request. Like so:

import emoji

print(emoji.EMOJI_DATA['🐕']['fr']) # would throw an error because fr data has not been loaded

emoji.load_languages(['fr', 'zh', 'ja'])
print(emoji.EMOJI_DATA['🐕']['fr']) # now it would work

If all the languages would be in separate files, it would probably reduce memory usage by about 50% for a user who only uses one language. It would still be a breaking change for the API though, since the first access in the example fails.

cvzi commented 1 year ago

Maybe we could use a class to emulate a dictionary with __getitem__(self, key). The __getitem__ could load the necessary language data from different files. That way there would be no breaking changes to the API.

Currently it looks like this:

EMOJI_DATA = {
    u'\U0001F415': { # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True,
        'de': ':hund:',
        'es': ':perro:',
        'fr': ':chien:',
        'ja': u':イヌ:',
        'ko': u':개:',
        'pt': ':cachorro:',
        'it': ':cane:',
        'fa': u':سگ:',
        'id': ':anjing:',
        'zh': u':狗:'
    },
    ...
}

Maybe the inner dictionaries could be objects instead and the language data could be in separate files:


EMOJI_DATA = {
    u'\U0001F415': ClassLikeADictionary({ # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True
    }),
    ...
}

class ClassLikeADictionary:
    def __getitem__(self, key):
        # Load language data if it is not loaded yet
        if languageIsNotLoaded(key):
          loadLanguageFromDataFile(key)
        return valueFor(key)
    ...

So you could still access it with EMOJI_DATA['🐕']['fr']

KOLANICH commented 1 year ago

If you don't want to use binary gettext .mo (was chosen mostly beacuse it contains a precomputed on-disk hashtable, though most of impls don't bother to actually use the hashtable), you have another option, a plain text TSV file lexicographically sorted by first column + some statistics to optimize lookup. Can be memory-mapped, then the mapping can be navigated with kinda binary search in the file.

emoji.EMOJI_DATA['🐕']['fr']

I guess since your impl is going to spend memory on each opened file for each language,

emoji.EMOJI_DATA['fr']['🐕']

should make more sense, because it'll make opening a new file more explicit.

cvzi commented 1 year ago

Thinking about the idea with updating the data files in CI, I had another idea that could make the memory usage smaller for average user:

We could release several flavors of this package instead of just one. For example:

emoji (full package as it is)
emoji_en (only English)
emoji_fr (only French)
...

I think this could be easy with CI/Github actions. The main thing to do would be to remove all languages from EMOJI_DATA except the one language and then publish on PyPi

TahirJalilov commented 1 year ago

I think keeping languages in a separate files (like it was before) but in one "emoji" project much better than create separate projects on PyPi.

carpedm20 / emoji

Data files should probably be in a separate package. #251