Diaoul / babelfish

BabelFish is a Python library to work with countries and languages
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Memory optimization #14

Closed Toilal closed 9 years ago

Toilal commented 10 years ago

Trying to optimize memory usage in babelfish.

Related to https://github.com/wackou/guessit/issues/97

Toilal commented 10 years ago

I still need to add some methods in DataTable class (like named columns support), and then use it in various converters.

Global data structure should be removed too (LANGUAGES and LANGUAGE_MATRIX, ...)

Diaoul commented 10 years ago

How much memory would be saved using this? 50% less?

Toilal commented 10 years ago

For LANGUAGE_MATRIX, it was 2.8MB with old data structure (in actual language module), and is now 0.8MB with new one (in memory branch test case).

Diaoul commented 10 years ago

I'm curious how performance is affected by this.

Toilal commented 10 years ago

I'll try to make another implementation based on mmap, for even less memory consumption.

Toilal commented 10 years ago

MmapDataTable is almost memory-free, but will be less performant as each call will result in a read on the filesystem. For ArrayDataTable, i think it will be as performant as previous implementation with namedtuple.

Though, I think MmapDataTable is the best implementation for almost every usage of babelfish.

coveralls commented 10 years ago

Coverage Status

Coverage increased (+2.29%) when pulling 884950f5145f6ed7099e040aea496b6703cb1ec6 on memory into b3cfc142c1ff981d60fccf00d595f2b1efc9ac41 on master.

Toilal commented 10 years ago

Next step is to modify the data structure in base converters to use MmapDataTable (or ArrayDataTable) directly, and remove all preloaded data like LANGUAGES, LANGUAGE_MATRIX, ...

Actual base classes should be keeped for quick custom implementations though.

Diaoul commented 10 years ago

I've done some profiling too. It seems import babelfish takes 4.8MB, from which 1.4MB are not due to loading the language file. Loading converters is quite cheap, maximum I saw was 0.2MB per converter. So that's 3.4MB to optimize. I don't like the idea to lookup the filesystem more than once at load time, if anything should be optimized, I'd say we simply optimize how the data is stored. This is plain ASCII so we can store raw bytes rather than unicode.

Honeslty I don't think this is worth spending time on for now.

Toilal commented 10 years ago

ArrayDataTable already stores raw bytes and converts to unicode on the fly when asking for a value.

Your option is the simple way to optimize, but keep in mind that we will load everything from ISO file even if we only need one converter (alpha2 for example), loosing the benefits of lazy loaded converters ...

But thats OK for me too for the moment, so now we have to :

Diaoul commented 10 years ago

That's the second point that bothers me, I'd like the API to be easy to use and straight forward for converters. I get your point but optimizing memory usage for a few MB isn't worth it. Not in Python anyway.

Currently guessit uses 43.5MB (without babelfish). I think it's OK for guessit to use 15MB or so as it is doing some heavy stuff but 43.5MB is just too much. With guessit fixed we can close this issue and move on. Maybe get back to it later but I doubt it as even embedded devices have now a lot of RAM available, 512MB on RPi and most common NASes.

Toilal commented 10 years ago

You're right :)

I'll terminate this fork on the following days, and merge in master when all is working properly.

By the way, I'll try to fix build/test errors on python 3.

Diaoul commented 10 years ago

Also, the LANGUAGES and LANGUAGE_MATRIX might be usefull for end users. They are easy to work with, everything is pre-loaded in well known python structures.

Let's keep this memory stuff in a separate branch and get back to it if somebody complains about it. We need to be carefull about the side effects and maintainability...

Toilal commented 10 years ago

I would agree if those structures weren't load at babelfish import time. How about moving them from __init__.py to a new specific matrices module ? User will then have choice to import this module or not.

I'll try to work more on this branch, and propose the best from both approach, with minimal API changes.

With some more work and ideas, i think we can keep it as pythonic as possible but lighter in memory.

Toilal commented 10 years ago

LANGUAGES, SCRIPTS and COUNTRIES are still loaded at import time and available to user (as in master). They are used by Language, Script and Country constructors to check validity. Those structure are quite light compared to their *_MATRIX equivalents.

LANGUAGE_MATRIX, SCRIPT_MATRIX, COUNTRY_MATRIX were removed. To retrieve data that was contained in those constants, user should use one of the integrated converter. Each integrated converter only loads its required data from iso file.

if user really needs those matrices, he still can load them using methods iso.get_countries_data(matrix=True), iso.get_languages_data(matrix=True), and iso.get_scripts_data(matrix=True).

import babelfish now consumes 1.8MB (= size of remaining constants). I consider this OK for guessit. Loading converter will increase the memory usage, but with converter related data only. Unloading converter will also release this memory.

Diaoul commented 10 years ago

I like this approach. I see that get_languages_data is mostly used in loops. Why not make it a generator? That done, I don't see why we'd need the matrix parameter, just behave like True all the time and do some generator comprehension to fill the sets LANGUAGES, COUNTRIES and SCRIPTS.

The only issue (is that really an issue?) is that we read the whole file multiple times, when loading converters. How does this impact loading time of a converter compared to in-memory LANGUAGE_MATRIX?

Toilal commented 10 years ago

:+1: for generator, i'll implements it.

For performance, i don't know. Speed of loaded converter is important (convert and reverse methods performance), but loading time of one converter type is done only one time and is not really important i think. You can try to make some tests if you want.

And last good point : We could optimize memory usage even more by calling the intern() function. Equals strings loaded from differents converters will then share the same string instance and global memory consumption should not raise more than current master matrices if using mutiple converters.

Le beurre et l'argent du beurre :) (Et la crémière !)

Toilal commented 10 years ago

Sadly, python2 can't intern() unicode string, so i added intern calls only for python 3.

I also added performance tests.

I think we are ready for merge, and then maybe release a 0.6 version ?

Toilal commented 10 years ago

Can we perform the merge ? To prepare guessit release ?

And release, 0.5.2 or 0.6 ?

Toilal commented 10 years ago

@Diaoul Are you OK to merge this into master ? And then perform a 0.6 release on pipy ?

If you allow me access in pipy, I can do the work ...

Diaoul commented 10 years ago

I'd like to do some clean up first and maybe remove the compat module as it is only use to test for python3. I'll do that this evening.

wackou commented 10 years ago

I just released 0.5.2 in order to support python 3.4 as an interim release, but it doesn't include this branch yet. I haven't had time to read everything in depth, so I'd want Diaoul's opinion about it before merging it, but there's one small thing that I noticed for now:

For example here: https://github.com/Diaoul/babelfish/blob/memory/babelfish/converters/opensubtitles.py#L22

you do lazy loading on the codes property and hence save the memory if that doesn't get called, but you have a huge set of languages that needs to be recreated each time you call this, at which point you'll need all the memory anyway.

So if you import babelfish and then never calls this function, then yes, you save memory but if you call it at least once, then you will use that memory (only temporarily yes), but then you might as well cache it from the beginning when you import the module. Pushing it further, if you use Babelfish in a multithreaded context then each thread will need the memory to instantiate codes whereas having it as a class variable only requires the memory to store it once.

I can see still why this would make sense on very memory-constrained devices, but I'm not sure this should be the default. Maybe a behavior configurable at runtime.

Diaoul commented 10 years ago

Exactly my point of view. Babelfish's memory consumption isn't that bad IMO given what it does. I remember that I investigated optimizing in-memory storage (using bytes rather than string).

Let's keep it simple for the moment and get back to it eventually but I don't see babelfish memory consumption being a problem in the future given that even embedded devices now have a lot of RAM.