Open pombredanne opened 1 year ago
What's the motivation for this? It would be useful to understand that since that helps inform the design approach (or whether it's even necessary at all, maybe I can suggest an alternative).
As far as whether it's possible... two approaches:
serde
support). This would be pretty a small patch for them, so plausibly they'd be happy to do it. Probably would want some sort of version tag, since underlying object can change.Either way there would be some code added to this repo as well.
- This would be pretty a small patch for them, so plausibly they'd be happy to do it.
Sadly this is not a small patch if we do it the way I want to do it. And as you point out, it's a compatibility hazard. (I do agree that a naive patch adding Serde support would probably be smallish in isolation.)
I do want some kind of serialization support some day, like what is in regex-automata
for DFAs, but not with Serde. With Serde, you have to go through a costly (de)serialization roundtrip. That might work for your pickle use case, but what I want is for the actual raw underlying byte representation in memory to also be the serialization format. (Again, like DFAs are in regex-automata
.) That way, deserialization is done in constant time by simply re-interpreting the bytes. This unlocks lots of cool use cases like, "I have a huge Aho-Corasick automaton that I want to generate offline and embed into my binary so I don't have to generate it at runtime." But... it is a lot of work and tends to introduce a lot of unsafe
where there is exactly none today.
twitches in big-endian Not that anyone runs on big-endian platforms these days? I guess you could use some zero-copy serialization format like capnproto or something to represent the data structures? Regardless, yeah that'd be a lot of work.
And I guess it's possible naive serde deserialization wouldn't even be any better than reparsing the needles, especially if the Python wrapper switches to the faster new NFA thing by default.
Probably naive serde deserialization will be noticeably faster than actually building the automaton itself. That kind of deserialization would do a lot of bulk copying which is going to be much faster than the byte-by-byte construction process of Aho-Corasick (and then the breadth first search of the entire trie to fill in the failure transitions).
As far as big-endian goes, indeed, for regex-automata
you have to build both little endian and big endian serialization files for each DFA. Assuming you're embedding them into your binary, only one gets compiled in. But yeah, that is a downside of the zerocopy approach.
I'd be surprised if something like capnproto could work for this, but I don't know of anyone who has tried it.
Anyway we'll see what the use case is. Pickling is often used in e.g. multiprocessing support in Python, for example, and for that use case aggressively invalidating old pickles with versioning isn't a problem since it's always the same version of the library in parent and child process.
For me, the use case would be caching the automaton. Currently, we build it from the (cached) terms in each request where it is needed, but that takes considerable time and I think deserializing would be much faster. For compatibility, it would be enough if it would refuse to load incompatible serialized data, we can fall back to rebuilding it from the terms.
Another possible way would be to share automaton between processes
+1 to this feature Same use case, e.g. 100 items to process with automaton, ProcessPoolExecutor created with 10 processes, each process requires it's own automaton instance. Creating a new automaton instance each time is rather inefficient.
Hi: any plan to support pickling or some serialization/deserialization?