BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.01k stars 93 forks source link

Feat Req: Making it easier to statically construct an AcAutomaton #9

Closed talevy closed 5 years ago

talevy commented 8 years ago

Hi there!

I am interested in creating something like the equivalent of the phf_codegen, but for AcAutomaton.

use case: search across CommonCrawl's dataset in search of emojis quickly!

To construct the AcAutomaton struct statically, I need to create State objects, but this is not possible. I guess the fields would have to be public for my current strategy to work as well... What do you think about potential changes for making this possible?

here is the idea what I'd like to achieve: http://play.integer32.com/?gist=a218ca3476bfc40fb39ff6ac79da8572

thanks!

BurntSushi commented 7 years ago

Sorry for the very late reply. I suppose I'm not opposed to changes in this direction, although I don't like the idea of exposing all of the internal state since that's something that could feasibly change. You might consider building out your own automaton and implementing the Automaton trait instead.

lukaslueg commented 7 years ago

Wouldn't it be easiest to have serialize/deserialize available if the pattern type is also serialize/deserialize ? One could include the json representation into the code and have it deserialize on startup. Breakage due to state representation changes would end up as well-handled serde errors.

lukaslueg commented 7 years ago

I've added feature-gated serde-serialization to AcAutomaton and did some benchmarking. The following table compares nanoseconds per item for constructing a AcAutomaton either via ::new() or a bincode-serialized version of it.

Items ns/item (::new()) ns/item (::deserialize()) bytes/item
5000000 items 1479 236 61
1000000 items 1433 200 60
100000 items 1401 203 59
1000 items 1527 289 61
100 items 1628 330 90
10 items 2536 1744 390

It seems beneficial to construct the automaton from its serialization for just a few hundred items upwards. On a modestly fast CPU like this mine, you need hundreds of thousands of items before constructing the automaton actually makes a difference in total runtime, though. Five million items of ~20 characters each amount to 280mb in bincode. I tested compression with LZ4, which brings it down to 12 bytes per item but totally kills the performance advantage.

In the end, the use case is something like:

Another use case is obviously storing automata on disk for your crazy world domination plans.

stockholmux commented 5 years ago

@lukaslueg I'm interested in doing something similar with serde. Do you have this version posted anywhere?

lukaslueg commented 5 years ago

@stockholmux I'm sorry, I removed that branch a long time ago.

BurntSushi commented 5 years ago

I'm going to close this issue out for mostly the same reasons for closing #22. See my comment here for an explanation. TL;DR - You can use regex-automata.