WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

High memory usage during pickling #102

Closed Dobatymo closed 5 years ago

Dobatymo commented 5 years ago

First: Thank you so much for fixing the pickling bug. Now let me be so impertinent to ask you if it is possible to reduce the memory usage during pickle.dump.

Using the same wikititles dataset as before, loading 7000000 lines requires 2.87 GB of memory. However, during pickle, the Python process reaches peak memory of 10.45 GB. That means about 7.5 GB of memory are used to pickle the object, which is almost 3 times the size of the original object in memory. The resulting file is 2.17 GB. For larger automatons I will run out of memory quickly and python raises a MemoryError.

WojciechMula commented 5 years ago

Thanks for this report, I'm aware of the issue. The problem is that we need to map physical addresses of objects into some abstract numbers and now it's done in an extremely naive way.

WojciechMula commented 5 years ago

@Dobatymo I tried to solve it, but seems pickle module does not support chunking of input. I mean, pickle collects all data and then write everything using single call to write() on a file object.

I'm thinking about adding custom save/load methods, that wouldn't be that memory consuming.

Dobatymo commented 5 years ago

Thank you for your efforts! It would be great if your custom save method could save it in a pickle compatible format. So that normal pickle.load would still work. But of course I would be happy also with any custom save/load methods.

WojciechMula commented 5 years ago

I know what you mean, but I am afraid it would be a burden for me. Pickle supports several formats, and even if we stick to the "highest", we need to catch up with changes, take into account Py2 & Py3, i.e. we'd have a heavy dependency. But this is my impression at the very moment, and I feel that it;d be easier to fix the official pickle module to support progressive saving. :)

BTW I did some hacking in the meantime. The full save/load sequence in Python code would be like this:

import ahocorasick
import pickle

A = ahocorasick.Automaton()
path = "automaton.dat"

A.save(path, pickle.dumps)
B = ahocorasick.load(path, pickle.loads)

Here is time & memory usage comparison of the WIP code:

+----------+---------------+------------+
| action   | memory [B]    |  time [s]  |
+==========+===============+============+
| save     |   597 880 832 |      2.58  |
+----------+---------------+------------+
| load     |   986 472 448 |      9.68  |
+----------+---------------+------------+
| pickle   | 1 757 429 760 |      3.89  |
+----------+---------------+------------+
| unpickle | 1 197 559 808 |      1.46  |
+----------+---------------+------------+

Automaton consumes 597MB. It means that save takes (almost) no extra memory; load requires extra memory. The current implementation of load is slow, I think it might be faster, perhaps at some memory cost.

Dobatymo commented 5 years ago

That looks great. I understand that the Pickle support would be too hard to maintain.

For loading, I would be happy to sacrifice speed for less memory. I mostly use this library in web apps, so loading is usually a one-time operation which just increases the time to start the app-server. Whereas if the memory usage is too high, it would prohibit loading the automaton at all. But that's just my preference.

Thanks again for your work!

WojciechMula commented 5 years ago

@Dobatymo I guess it's possible to shrink memory requirement to really small values, but at cost of multiple re-reading input file and re-scanning memory. Say, your machine has 16GB, and automaton would occupy 15,5GB. Then you might request that 50MB memory during loading is the max.

BTW, have you ever specified the store attribute of Automaton class. I mean one of the constants STORE_ANY, STORE_INTS?

Dobatymo commented 5 years ago

Of course it would be great to have the loading method be adaptable to the available memory. But I think at the moment my limiting factor is pickle.dump, the new save/load methods would already improve things a lot.

I have not specified the STORE attribute, as I use the automaton to store Python bytes objects, and there is no constant for that, is there?

WojciechMula commented 5 years ago

@Dobatymo I released 1.3.0 which supports save/load methods. Usage is exactly as I posted above, but since you store only bytes objects, I believe callbacks passed to save/load can be simple lambda x: x

If loading time will be still too slow for you, please open a new issue. (I'll try to improve it, but without an issue it's not a priority.)

Speaking of STORE constants, bytes object are just python native objects, so no special flag is needed. I'm asking, because I'd like to get rid off this flag and then store only python objects.