Closed talevy closed 5 years ago
Sorry for the very late reply. I suppose I'm not opposed to changes in this direction, although I don't like the idea of exposing all of the internal state since that's something that could feasibly change. You might consider building out your own automaton and implementing the Automaton
trait instead.
Wouldn't it be easiest to have serialize/deserialize available if the pattern type is also serialize/deserialize ? One could include the json representation into the code and have it deserialize on startup. Breakage due to state representation changes would end up as well-handled serde errors.
I've added feature-gated serde-serialization to AcAutomaton
and did some benchmarking. The following table compares nanoseconds per item for constructing a AcAutomaton
either via ::new()
or a bincode
-serialized version of it.
Items | ns/item (::new()) | ns/item (::deserialize()) | bytes/item |
---|---|---|---|
5000000 items | 1479 | 236 | 61 |
1000000 items | 1433 | 200 | 60 |
100000 items | 1401 | 203 | 59 |
1000 items | 1527 | 289 | 61 |
100 items | 1628 | 330 | 90 |
10 items | 2536 | 1744 | 390 |
It seems beneficial to construct the automaton from its serialization for just a few hundred items upwards. On a modestly fast CPU like this mine, you need hundreds of thousands of items before constructing the automaton actually makes a difference in total runtime, though.
Five million items of ~20 characters each amount to 280mb in bincode
. I tested compression with LZ4, which brings it down to 12 bytes per item but totally kills the performance advantage.
In the end, the use case is something like:
AcAutomaton
for.build.rs
, spend your six seconds there. Put the result into a buffer that is compiled in.lazy_static
, spend one instead of six seconds for first use.Another use case is obviously storing automata on disk for your crazy world domination plans.
@lukaslueg I'm interested in doing something similar with serde. Do you have this version posted anywhere?
@stockholmux I'm sorry, I removed that branch a long time ago.
I'm going to close this issue out for mostly the same reasons for closing #22. See my comment here for an explanation. TL;DR - You can use regex-automata
.
Hi there!
I am interested in creating something like the equivalent of the
phf_codegen
, but for AcAutomaton.use case: search across CommonCrawl's dataset in search of emojis quickly!
To construct the AcAutomaton struct statically, I need to create
State
objects, but this is not possible. I guess the fields would have to be public for my current strategy to work as well... What do you think about potential changes for making this possible?here is the idea what I'd like to achieve: http://play.integer32.com/?gist=a218ca3476bfc40fb39ff6ac79da8572
thanks!