adbar / simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html
MIT License
146 stars 12 forks source link

Investigate other data structures to store language data #72

Open adbar opened 1 year ago

adbar commented 1 year ago

So far language data are stored as dictionaries, i.e. key/value pairs of words and ground forms. Python dictionaries are quite efficient in terms of speed and rather efficient space-wise.

However there might be other data structures around which allow for better data compression without impeding speed. Slightly slower code execution would even be acceptable if the amount of necessary RAM decreases.

I thought about using a trie or a graph-like structure (e.g. a DAG) but I haven't found anything that would be directly usable and beneficial. The Pytries group offers interesting options but they would require to use a C library which is a maintenance issue.

At best the data structure could be integrated directly into Simplemma, e.g. a vanilla implementation of a trie in pure Python.

adbar commented 8 months ago

Additional note: dictionaries can be more compact if keys and values are bytes instead of str. This would be a first step to decrease memory footprint.

Dunedan commented 6 months ago

I have the use case that I want to detect the language of chat messages in real time and do lemmatization on them. To do so I obviously have to keep the dictionaries for multiple languages loaded, but even when I limit it to the most common ones, the memory consumption of the dictionaries makes it unfeasible.

Therefore, I started looking into more efficient ways to keep the data in memory and ended up looking into tries as well. To get a feeling for what is possible, I looked for a mature trie implementation and ended up at marisa-trie, as that library checks most boxes and is one of the very few Python trie libraries, which is actively maintained.

I then replaced the use of a dicts for the language data with BytesTrie, while keeping pickle and the LZMA-compression and looked at how that affected sizes and loading times. The changes I made to simplemma are available here: https://github.com/Dunedan/simplemma/commit/2bdc2d2e50e38ddf3acfe5a0764faf82bda2817c

dict MARISA-tries with DEFAULT_NUM_TRIES MARISA-tries with MIN_NUM_TRIES
Sum file size LZMA compressed 69.8MB 75.0MB 77.6MB
Sum file sizes uncompressed 687.7MB 191.0MB 221.7MM
Memory consumption all data ~4200MB ~190MB ~285MB
Loading time from compressed files for all data ~168s~ ~16.8s ~40s~ ~4.0s ~43s~ ~4.3s

For the memory consumption I made sure that all dictionaries and not only the 8 most recent ones are kept in memory.

While the compressed files are slightly larger than the pickled dicts, I find the improvements in memory consumption and loading time very impressive.

Next I took all treebank texts from Universal Dependencies and benchmarked the lookup performance with it. When doing that I ensured that the language dictionaries were kept in memory and disabled the LRU-caching of already looked up tokens and did multiple runs to get a representative average for the runtime. The results are of course still highly depended on my machine and the test data and might look different for other data.

dict MARISA-tries with DEFAULT_NUM_TRIES MARISA-tries with MIN_NUM_TRIES
text_lemmatize() 46.3s 84.3s 75.4s
langdetect() 41.0s 66.5s 60.1s

As shown above, performance, both for lemmatization and language detection, suffers quite a bit. Mind that this is still performance enabled a mature C++-library, under the hood, so I expect performance of a native Python trie implementation to be worse, both in terms of trie size, as well as lookup performance. As marisa-trie offers a pretty exhaustive list of wheels for all kinds of different platforms, I believe using it would be nothing to worry about from a maintenance perspective.

There is a very big caveat though: The benchmarks above were done with pre-computed MARISA-tries, which were persisted using pickle, the same way as it's currently done with the dicts. However, serialized MARISA-tries aren't architecture-independent and as far as I can see, pickling a trie, just does the native C++ serialization. Storing the data in another format is not feasible in my opinion, as that'd increase the loading time immensely. Using serialized MARISA-tries would still be doable, but would require simplemma to generate wheels for the different platforms and architectures during the release of a new version and publish to a wheel for each platform and architecture. I guess that's probably too much headache.

Long story short: I'm quite impressed with the results and performance wise it'd be great for my use case, as I'd happily trade lookup performance for memory utilization, but I'm not sure if the impact on lookup performance and the issue with persisting the tries wouldn't be blockers for using this specific library in simplemma.

osma commented 6 months ago

@Dunedan this is great work!

I noticed that you had disabled LRU caching of already looked up tokens in your benchmarks. While this is a good way to measure the raw speed of the data structure implementation, I don't think it gives a realistic view of performance from a real-world user perspective, right?

With LRU caching enabled, you could see the actual performance impact on users if the data structure were to be changed. In your benchmarks the tries were about 40-50% slower than the current dict implementation, but I would expect this gap to be smaller with LRU caching enabled. Perhaps the LRU cache size could be increased as well, because caching is a time vs. memory trade-off and if memory usage decreases significantly due to the trie data structures, then some of that saved memory could be spent on the LRU cache if that makes overall performance better.

Dunedan commented 6 months ago

While this is a good way to measure the raw speed of the data structure implementation, I don't think it gives a realistic view of performance from a real-world user perspective, right?

Yes, that's true. I did that, because I used timeit to do multiple iterations of the benchmarking and every run after the first one would've just hit the cache instead of the data structure used to store the data, skewing the results. I recognize that I could've avoided that, by just initializing new instances of Lemmatizer per run, but I wanted quick results to get a picture about what to expect and disabling the cache was what first what crossed my mind in that situation. Also when using the cache, the results depend even more on the actual input texts used.

Perhaps the LRU cache size could be increased as well, because caching is a time vs. memory trade-off and if memory usage decreases significantly due to the trie data structures, then some of that saved memory could be spent on the LRU cache if that makes overall performance better.

I'd argue that the default cache size might already be too large, as it'll on its own already consume quite a lot of memory when filled.

adbar commented 6 months ago

@Dunedan First I'd like to say I play 0 A.D. from time to time so it is really nice to see you find Simplemma useful in this context.

Thanks for sharing your results, the marisa-trie library is indeed what I would go for. It would not be as portable but we could let the users re-compile the dictionary data locally, your code does not need a lot of changes and should work nicely.

I'll focus on working on bytestring dictionaries first as they improve things a bit but if you want to write a PR here is what I suggest:

  1. Make marisa-trie available as optional dependency
  2. Add an argument to the existing function in dictionary_pickler.py to re-compile the dictionaries in place on user's request
juanjoDiaz commented 6 months ago

Also a good alternative could be to publish simplemma dictionaries (including a dictionary factory) into a separate module. That way, the user could decide which dictionaries to use based on his own needs.

Dunedan commented 6 months ago

@Dunedan First I'd like to say I play 0 A.D. from time to time so it is really nice to see you find Simplemma useful in this context.

Great to hear. Our hope is that we can use Simplemma to easier detect profanity being used by people in the multiplayer lobby. If that's something which works out is to be seen.

Thanks for sharing your results, the marisa-trie library is indeed what I would go for. It would not be as portable but we could let the users re-compile the dictionary data locally, your code does not need a lot of changes and should work nicely.

I'll focus on working on bytestring dictionaries first as they improve things a bit but if you want to write a PR here is what I suggest:

  1. Make marisa-trie available as optional dependency
  2. Add an argument to the existing function in dictionary_pickler.py to re-compile the dictionaries in place on user's request

I'm not sure if that'd be a good user experience, as it'd require manual steps for each installation.

While that could be streamlined by doing it automatically during installation, my understanding is that the Python ecosystem is moving away from supporting to run custom code at installation, so that's probably not a good way moving forward. An alternative would be to generate the trie-based dictionaries on-the-fly when a user is calling get_dictionary() of a trie-backed DictionaryFactory for the first time. That'd slow down the first load a second or two per language, but IMO when that's properly documented it'd be fine.

I could prepare a draft PR for that if there is interest for this.

Also a good alternative could be to publish Simplemma dictionaries (including a dictionary factory) into a separate module. That way, the user could decide which dictionaries to use based on his own needs.

Unrelated to the discussion about storing the dictionaries as tries or not, I believe that's an excellent idea, as it can be used to cut down the installation size of Simplemma significantly. What I'm thinking of is that only the English dictionary gets installed by default and users have to specify the additional dictionaries they need as extra dependencies. Like that for example:

The additional packages could then be published under names like simplemma-dict-de on PyPI. With that, users could decide which dictionaries to install and reduce the amount of required disk space.

adbar commented 6 months ago

Concerning the first point and before you draft a PR: how about using pure-Python tries? Maybe the slowdown is acceptable considering the portability of this solution?

Breaking down language data into several packages is nice but I don't want to end up with 35 different language packages to maintain... How about simplemma-base and simplemma-extra? Then we would need to discuss how to split the data. Even so I wonder if it's worth the trouble considering that downloading about 60-70 MB is not that much by today's standards. In any case, it would be nice to document how to build an extension package in case someone wants to publish or maintain data for particular languages.

juanjoDiaz commented 6 months ago

I was not thinking of having a package per language dictionary.

I was more thinking of having a package with all dictionaries as bytestring dicts, another that have them as tries, etc.. I could create a custom one adding some languages. Or another with a different data structure. etc...

Dunedan commented 6 months ago

Concerning the first point and before you draft a PR: how about using pure-Python tries? Maybe the slowdown is acceptable considering the portability of this solution?

This thing is: it doesn't have to be any trie implementation, but an implementation of a trie specifically designed for optimized memory usage. The README of MARISA-trie includes a table with comparisons of the storage needed with different trie implementations to highlight its efficiency: https://github.com/s-yata/marisa-trie#description

However, I also did some experiments with an own, naive, pure-Python implementation of a trie. It performed worse in every regard, even memory consumption, compared to the current approach with pickled dicts. While it's possible that I did something fundamentally wrong with my trie implementation, what I want to get at is that implementing a trie with the desired characteristics isn't trivial and should therefore be left to specialized libraries in my opinion.

Breaking down language data into several packages is nice but I don't want to end up with 35 different language packages to maintain

I believe that wouldn't necessarily be more effort than keeping all language data in a single package: the data could still live in a single repository and the creation and upload of one wheel per language could be automated. Other projects do that as well. One project which comes to my mind is typeshed, however their solution to automate wheel creation and upload is probably a bit to overengineered for other projects.

Even so I wonder if it's worth the trouble considering that downloading about 60-70 MB is not that much by today's standards.

While the current package size is no deal-breaker for me, it still feels wasteful. I'd also be fine with keeping Simplemma a single package for now. A split could be reconsidered if its size grows in future.

Sorry for having side-tracked the discussion with the per-language package split idea, while it's actually about memory consumption.

Dunedan commented 6 months ago

So, I managed to put together something I believe does work quite well.

It's still using a MARISA-trie per language, but these tries are now generated from the pickled Python dicts shipped by Simplemma on first usage and cached afterwards. Generating the tries for all languages takes ~40s on my machine, compared to ~13s it takes to just load the pickled Python dicts with bytes. As generating the tries requires loading the pickled dicts, memory usage is higher when generating the tries, compared to afterwards when the cached ones can be used. That's especially bad for Swahili, Finnish and Polish, which require more than 900MB of memory during generation (each other language takes less than 500MB). However, that only has to be done once and can be done manually, even on another computer, as well.

I also managed to cut down loading the tries for all languages from ~43s to <1s, by simply not storing them compressed. As the tries are already a dense and optimized data structure, they had a compression ratio of only 1.7:1 anyway. The cached tries now take up ~135MB on disk.

Included is also logic to automatically regenerate the tries, when the content of the pickled dicts changes, which will certainly be the case from time to time when new versions of Simplemma get released.

All of this is implemented as an additional DictionaryFactory and doesn't require changes elsewhere in Simplemma. For that I had to wrap the BytesTrie with a class making it behave like a dict with bytestring keys and values.

Performance wise it's perceptibly slower than using the pickled dicts, but still plenty fast. I struggled a bit to produce useful benchmark numbers, as the performance varies so much depending on the use case (cache hit rate, number of languages, lifetime of DictionaryFactory objects, ...), so I'm not trying to produce numbers for that.

Here is a PR with the implemented changes: https://github.com/adbar/simplemma/pull/133

Please let me know what you think about that and how the performance impact is for your typical use cases.