G-Research / ahocorasick_rs

Check for multiple patterns in a single string at the same time: a fast Aho-Corasick algorithm for Python
Apache License 2.0
159 stars 12 forks source link

Add support for pickle or some serialization/deserialization #41

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

Hi: any plan to support pickling or some serialization/deserialization?

>>> import ahocorasick_rs
>>> patterns = ["hello", "world", "fish"]
>>> ac = ahocorasick_rs.AhoCorasick(patterns)
>>> import pickle
>>> q=pickle.dumps(ac)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'builtins.AhoCorasick' object
itamarst commented 1 year ago

What's the motivation for this? It would be useful to understand that since that helps inform the design approach (or whether it's even necessary at all, maybe I can suggest an alternative).

As far as whether it's possible... two approaches:

  1. Rely on the underlying Rust library and serialize the state machine it creates. At the moment it does not support serialization (i.e. it doesn't integrate serde support). This would be pretty a small patch for them, so plausibly they'd be happy to do it. Probably would want some sort of version tag, since underlying object can change.
  2. Serialize just the underlying needles (search terms), and recreate the actual underlying search object during unpickling. This means more expensive unpickling, but also means pickles would usually be backwards- and forwards-compatible.

Either way there would be some code added to this repo as well.

BurntSushi commented 1 year ago
  1. This would be pretty a small patch for them, so plausibly they'd be happy to do it.

Sadly this is not a small patch if we do it the way I want to do it. And as you point out, it's a compatibility hazard. (I do agree that a naive patch adding Serde support would probably be smallish in isolation.)

I do want some kind of serialization support some day, like what is in regex-automata for DFAs, but not with Serde. With Serde, you have to go through a costly (de)serialization roundtrip. That might work for your pickle use case, but what I want is for the actual raw underlying byte representation in memory to also be the serialization format. (Again, like DFAs are in regex-automata.) That way, deserialization is done in constant time by simply re-interpreting the bytes. This unlocks lots of cool use cases like, "I have a huge Aho-Corasick automaton that I want to generate offline and embed into my binary so I don't have to generate it at runtime." But... it is a lot of work and tends to introduce a lot of unsafe where there is exactly none today.

itamarst commented 1 year ago

twitches in big-endian Not that anyone runs on big-endian platforms these days? I guess you could use some zero-copy serialization format like capnproto or something to represent the data structures? Regardless, yeah that'd be a lot of work.

And I guess it's possible naive serde deserialization wouldn't even be any better than reparsing the needles, especially if the Python wrapper switches to the faster new NFA thing by default.

BurntSushi commented 1 year ago

Probably naive serde deserialization will be noticeably faster than actually building the automaton itself. That kind of deserialization would do a lot of bulk copying which is going to be much faster than the byte-by-byte construction process of Aho-Corasick (and then the breadth first search of the entire trie to fill in the failure transitions).

As far as big-endian goes, indeed, for regex-automata you have to build both little endian and big endian serialization files for each DFA. Assuming you're embedding them into your binary, only one gets compiled in. But yeah, that is a downside of the zerocopy approach.

I'd be surprised if something like capnproto could work for this, but I don't know of anyone who has tried it.

itamarst commented 1 year ago

Anyway we'll see what the use case is. Pickling is often used in e.g. multiprocessing support in Python, for example, and for that use case aggressively invalidating old pickles with versioning isn't a problem since it's always the same version of the library in parent and child process.

nijel commented 1 year ago

For me, the use case would be caching the automaton. Currently, we build it from the (cached) terms in each request where it is needed, but that takes considerable time and I think deserializing would be much faster. For compatibility, it would be enough if it would refuse to load incompatible serialized data, we can fall back to rebuilding it from the terms.

prorealize commented 11 months ago

Another possible way would be to share automaton between processes

Azzonith commented 3 months ago

+1 to this feature Same use case, e.g. 100 items to process with automaton, ProcessPoolExecutor created with 10 processes, each process requires it's own automaton instance. Creating a new automaton instance each time is rather inefficient.