import takes ~30 seconds

Snawe commented 10 months ago

Hi! I just upgraded my application to python 3.12.

Doing import emoji there takes arround 30 seconds there. Doing the same on python 3.11 takes less than a second.

Any clue?

is reproducable with a simple 2 line script like

import emoji
print (emoji.is_emoji("done"))

Using Windows atm with emoji 2.9.0

Something similar was already reported here: https://github.com/carpedm20/emoji/issues/274

cvzi commented 10 months ago

I have no idea at the moment.

If you have the time, could you check two things:

Does it also happen with just the single import line?

import emoji

And maybe test some older versions of the module, to see if it is a new change that introduced this problem. For example these versions:

2.6.0
2.0.0
1.5.0

Do pip install emoji==2.6.0 to install a specific version.

If you are interested in trying it, I could create a test version for you that only has English language (or whatever languages you need) as suggested in the other issue. That would presumably reduce memory usage overall and reduce start-up time.

Snawe commented 10 months ago

Does it also happen with just the single import line?

yes

regarding emoji versions:

2.9.0 => 34 sec 2.6.0 => 24 sec 2.0.0 => 13 sec 1.5.0 => 10 sec

But I just found out something really strange... Like I already mentioned int the beginning, this only happens on python3.12. With python3.11 this does not happen. What I found out now, it only happens if I run it via vs code. So if I hit F5 in vs code, I get the above times. If I run it directly in the commandline, I still see that the time doubled (more emojis I think), but the times are:

2.9.0 => ~0.06 sec 2.0.0 => ~0.03 sec

I still think that it has something to do with the emoji library, since every other doesn't take ~500times longer, but yeah....

Just to complete the list, with python 3.11 via vs code:

2.9.0 => 0.03 sec 2.6.0 => 0.03 sec 2.0.0 => 0.03 sec 1.5.0 => 0.03 sec

From the commandline: 2.9.0 => 0.03 sec 2.6.0 => 0.03 sec 2.0.0 => 0.03 sec 1.5.0 => 0.03 sec

cvzi commented 10 months ago

Wow, thanks for the details! My first guess would be that VS Code attaches a debugger or something similar and that somehow changed between Python 3.11 and 3.12

I will look into it.

cvzi commented 10 months ago

I can reproduce it with Python 3.12 on Windows 10 when running with F5 in VS Code. If you run it without the debugger CTRL+F5, it doesn't happen. (Also I see no problems with Python 3.11)

VS Code runs a file like this, when you press F5: cmd /C "C:\Python312\python.exe c:\Users\cuzi\.vscode\extensions\ms-python.python-2023.22.1\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher 64938 -- C:\Users\cuzi\Desktop\mytest.py "

I have create a issue at debugpy, maybe they know why this happens: https://github.com/microsoft/debugpy/issues/1496

cvzi commented 10 months ago

The problem seems to be the large number of lines in the file data_dict.py (currently ~87k lines).

The dictionary in the file can be compressed into a single line:

EMOJI_DATA = {
        '\U0001F947': {
        'en': ':1st_place_medal:',
        'status':
        ...

-->

EMOJI_DATA={'\U0001F947':{'en':':1st_place_medal:','status: ....

resulting in a file with just 46 lines. With the compressed file, the debugging runs as fast as in Python 3.11

TahirJalilov commented 10 months ago

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

Snawe commented 10 months ago

oh wow! Thank you very much for looking into it! :)

cvzi commented 10 months ago

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

I agree.

Not sure it will help enough regarding this problem though, because the dictionary would still be huge. It takes 4 minutes on my computer at the moment. Even if it cuts the time to 10% it would still take 25 seconds, far too long.

Putting the dictionary into a single line is obviously really ugly. But it would be a quick fix.

I guess using a different file format, not Python code, could solve this problem with the debugging. For example storing the dictionary in a JSON file and the loading the JSON file when the module is imported.

lsmith77 commented 10 months ago

I guess right now there is no work-around when using Python 3.12?

lsmith77 commented 10 months ago

@cvzi based on 6fb1321323b046992e2317786235230cd2db8faf you are releasing a work-around?

cvzi commented 10 months ago

I guess so. I am not really happy with putting the dict into a single line, but there seems to be no other quick work around. And VS Code is one of the most used editors at the moment and already there seem to be about 2000/day downloads from Python 3.12 of this library (according to PyPi stats).

I have deployed that commit on my own apps, and it seems to work, i.e. in release environment, not debugging.

@lsmith77 any change you could test if it actually solves the problem with VS Code for you? It does solve it on my computer. You can install from my branch cvzi/one_line_dict like this: pip install https://github.com/cvzi/emoji/archive/one_line_dict.zip and then just create a python file with import emoji and run it in VS Code with debugging i.e. F5

cvzi commented 10 months ago

BTW for reference: I also tried to put each "sub-dict" (each emoji) on a single line instead of everything in one line:

EMOJI_DATA = {
    '\U0001F947': {'en': ':1st_place_medal:','status': fully_qualified,'E': 3,'de': ':goldmedaille:','es': ':medalla_de_oro:','fr': ':médaille_d’or:','ja': ':金メダル:','ko': ':금메달:','pt': ':medalha_de_ouro:','it': ':medaglia_d’oro:','fa': ':مدال_طلا:','id': ':medali_emas:','zh': ':金牌:','ru': ':золотая_медаль:','tr': ':birincilik_madalyası:','ar': ':ميدالية_مركز_أول:'},
    '\U0001F948': {'en': ':2nd_place_medal:','status': fully_qualified,'E': 3,'de': ':silbermedaille:','es': ':medalla_de_plata:','fr': ':médaille_d’argent:','ja': ':銀メダル:','ko': ':은메달:','pt': ':medalha_de_prata:','it': ':medaglia_d’argento:','fa': ':مدال_نقره:','id': ':medali_perak:','zh': ':银牌:','ru': ':серебряная_медаль:','tr': ':ikincilik_madalyası:','ar': ':ميدالية_مركز_ثان:'},
    '\U0001F949': {'en': ':3rd_place_medal:','status': fully_qualified,'E': 3,'de': ':bronzemedaille:','es': ':medalla_de_bronce:','fr': ':médaille_de_bronze:','ja': ':銅メダル:','ko': ':동메달:','pt': ':medalha_de_bronze:','it': ':medaglia_di_bronzo:','fa': ':مدال_برنز:','id': ':medali_perunggu:','zh': ':铜牌:','ru': ':бронзовая_медаль:','tr': ':üçüncülük_madalyası:','ar': ':ميدالية_مركز_ثالث:'},
    ...

That reduces the import time (as expected) but it still takes too long, about 15 seconds on my computer.

lsmith77 commented 10 months ago

sorry didn't get to it today will try to do it tomorrow morning

lsmith77 commented 10 months ago

I tried my best but I am stuck in virtualenv hell here. pip needs to be updated to even install the package and I am somehow unable to figure out how to get pip to both upgrade and then actually use 3.12

anyway using pdm I got it to work nice and fast (first excution) and then slow using the official package (second execution)

so overall I can confirm your workaround does what it is supposed to.

cvzi commented 10 months ago

@lsmith77 Thanks for checking!

lsmith77 commented 10 months ago

thank you for this package and caring about reports such as this one!

cvzi commented 9 months ago

I did some performance tests to check the feasibility of JSON compared to the Python-dictionary-literal. Below are the import times for different methods of loading the dictionary. My conclusion is that JSON could be used and it would be viable to split the languages into separate JSON files

	import time
Python dict, pretty-printed, human-readable (before this bugfix)	0.16004
One-line Python dict (current master branch)	0.15565
JSON file, pretty-printed, human-readable	0.22966
Compressed JSON file, one-line, no spaces	0.19430
Splitted JSON files, pretty-printed, load English and metadata from one file, all other languages removed	0.15470
Splitted JSON files, first load English and metadata (as above) then load ONE other language from another JSON file	0.19083

Command to test this:

perf stat -r 10 -B python -c "import emoji; emoji.emojize(':lion:')"

where 10 is the repeats (should be much higher for good average results)

cvzi commented 7 months ago

I am going to continue in this thread with this JSON idea, please unsubscribe if you're not interested. Any feedback or suggestions are appreciated though :)

I am thinking about making a main JSON file that has the metadata and English/aliases and a file for each language.

Main file:

{
  "🗺️": {
    "E": 0.7,
    "en": ":world_map:",
    "status": 2,
    "variant": true
  },
  "🗻": {
    "E": 0.6,
    "en": ":mount_fuji:",
    "status": 2
  },
  "🗼": {
    "E": 0.6,
    "alias": [
      ":tokyo_tower:"
    ],
    "en": ":Tokyo_tower:",
    "status": 2
  },
  ...
}

A language file would look like this, e.g. Spanish:

{
  "🗺️": ":mapa_mundial:",
  "🗻": ":monte_fuji:",
  "🗼": ":torre_de_tokio:",
  ...
}

The main file would be loaded when importing the module. The language file would only be loaded when the language is used with d/emojize(str, language='es'). It would be loaded into EMOJI_DATA and the EMOJI_DATA dict would have the same structure as before.

It does mean that the EMOJI_DATA dict is incomplete after importing the module, because all the languages are missing.

This roughly reduces memory usage by about half, if only one language is used. Import time with only English is slightly faster (about 10%). Import time with one other-than-English language is slightly slower (about 10%).

Advantages:

more languages can be added without increasing memory usage
also more (new) metadata could be added to separate JSON files as well
JSON data can be read by other programming languages
This debugging-bug would be solved

Disadvantages:

You can't directly use e.g. EMOJI_DATA['🗺️']['fr'] anymore. To mitigate this, there could be a new function to load one or all languages from the JSON files.
JSON is less human-readable than a python dict, mainly comments aren't possible

So this would be a breaking change, but I don't think this would affect many people, I searched in github and I couldn't find a public repository that directly uses something like EMOJI_DATA['🗺️']['fr'] (There are a few repositories that use English i.e. EMOJI_DATA['🗺️']['en'] but that would still work)

lovetox commented 7 months ago

I think this makes sense, and i see no other option, there are simply a lot of languages and most applications need exactly one. Loading them on demand seems the right decision.

EDIT: Question about your performance test methodology, does your command not also include the starting of the whole python interpreter? This would only be relevant for someone who uses this lib standalone.

For most projects this will just one of many dependencies.

i did a quick test with 2.11.0

def load():
    import emoji
    emoji.emojize(':lion:')

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=1)
    print(res)

and this gives me a load time of just the lib around 0.030 s on my machine.

As the import statement is only once executed, even on repeats, raising the number of repetitions does not yield interesting data.

cvzi commented 7 months ago

Yes my times include the loading of the Python interpreter. It doesn't really matter, because I am only interested in the relative changes. Measuring a single import is not really robust, for example because some other process could be using the CPU at the same time as the test.

It is possible to load the module multiple times in Python, but it is a bit hacky:

import sys

def load():
    import emoji
    emoji.emojize(':lion:')

    # remove the emoji modules from the loaded modules
    for name in [name for name in sys.modules if "emoji" in name]:
        del sys.modules[name]

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=100)
    print(res)

cvzi commented 6 months ago

FYI compressing the dict into single line has caused coverage to break on Python 3.12.3 https://github.com/nedbat/coveragepy/issues/1785 Edit: coverage should be fixed with Python 3.12.4

carpedm20 / emoji

import takes ~30 seconds #280