Serializing pre-built automaton

plusvic commented 1 year ago

I'm evaluating the possibility of using this crate for a project (https://github.com/VirusTotal/yara-x) as this implementation of Aho-Corasick seems to have all the features I need, and it's very fast.

The only feature I'm missing so far is the possibility of serializing an automaton, allowing me to load it again from a file without incurring the cost of building the automaton again. The automatons I plan to build are very large, with hundreds of thousands short patterns, even millions. As I don't know much about the internal design of this crate I'm not sure whether a feature like this makes sense, maybe the cost of deserializing the automaton would be similar to the with the cost of building it from scratch. Intuitively, I would say that for small automatons this won't make any differences, but for very large ones deserializing a very large automaton should be faster than building it again.

I would volunteer for implementing this feature if you think that this is something worth to have, but I would like to hear your advise on this. Thanks for your great work!

BurntSushi commented 1 year ago

I'll give a more detailed response later, but:

Could you please measure automation construction for your use case? It might be fast enough. If it is, then this particular rabbit hole can be avoided.
Have you seen regex-automata? It supports serializing DFAs.

BurntSushi commented 1 year ago

I want to start off by saying that this is not something that can just be added in a simple PR. A change like this is incredibly invasive. It impacts everything from the implementation all the way up to the API. I know this because I've actually done it for regex-automata. Moreover, doing this right involves some really critical trade offs:

In order to make deserialization have minimal cost, unsafe needs to be used to cast things like a &[u8] into a transition table. This means dealing with alignment and just generally making sure you aren't mucking things up. This kind of unsafe isn't as easy as, say, explicitly eliding bounds checks in a few places, but it also isn't as difficult as correct use of unsafe inside of a generic data structure. Still. It's a thing to consider. Right now this library uses zero unsafe outside of its SIMD optimizations.
The serialization format itself becomes an API guarantee. Since making deserialization be zero-copy is really want you want in order to make it cheap, it turns out that it's not just the serialization format that is an API guarantee, but the in-memory representation itself. This becomes a maintenance issue whenever one wants to evolve the implementation, because now you need to make sure you continue to support whatever the serialization format was. To be very clear, this isn't a "I don't know how to solve this problem." This is a "it makes internal evolution far more annoying than it is today."

Moreover, there are some practical issues. If you take a look at the code on the master branch, which is unreleased, there are now three different flavors of aho-corasick implementation (the current latest release only has two):

A non-contiguous NFA. (Corresponds to the NFA in the current release.)
A contiguous NFA. (This is new.)
A DFA. (Roughly same as the DFA in the current release.)

A non-contiguous NFA, as you might guess from its name, cannot be deserialized in a zero cost way because it has an unavoidable need for some kind of indirection. That is, its in memory representation uses pointers everywhere. A contiguous NFA gets rid of the pointers and (roughly) uses one single Vec<u32>. A DFA does roughly similar.

So okay, you can serialize either a contiguous NFA or a DFA. That seems fine, right? The problem is that the contiguous NFA, because of its various compression tricks, (currently) has a more limited state ID space. So it will crap out with lots of patterns. And the DFA, for a million patterns, becomes impractically large. Here are some data points.

Builds a noncontiguous NFA for 100,000 and 1,000,000 patterns (titles randomly drawn from Wikipedia):

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.100000 ./LICENSE-MIT --no-search --kind noncontiguous 
pattern read time: 4.316483ms
[2023-03-22T12:08:07Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:08:08Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 1533980, size: 98592884>
[2023-03-22T12:08:08Z DEBUG aho_corasick::ahocorasick] forcefully chose noncontiguous NFA
automaton build time: 240.028578ms
automaton heap usage: 98592884 bytes

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.1000000 ./LICENSE-MIT --no-search --kind noncontiguous
pattern read time: 16.474072ms
[2023-03-22T12:08:10Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:08:13Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 12861890, size: 828984632>
[2023-03-22T12:08:13Z DEBUG aho_corasick::ahocorasick] forcefully chose noncontiguous NFA
automaton build time: 3.151933665s
automaton heap usage: 828984632 bytes

This kind of looks promising right? A little over 3 seconds (on my machine) to build an automaton with 1,000,000 patterns. You might stop and ask yourself whether that's good enough. Remember though, this is the non-contiguous NFA. It's the slowest one for search times. But maybe it's good enough for you.

OK, let's try the contiguous NFA:

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.100000 ./LICENSE-MIT --no-search --kind contiguous 
pattern read time: 3.679582ms
[2023-03-22T12:09:38Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:09:39Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 1533980, size: 98592884>
[2023-03-22T12:09:39Z DEBUG aho_corasick::ahocorasick] forcefully chose contiguous NFA
[2023-03-22T12:09:39Z DEBUG aho_corasick::nfa::contiguous] building contiguous NFA
[2023-03-22T12:09:39Z DEBUG aho_corasick::nfa::contiguous] contiguous NFA built, <states: 1533980, size: 14540940, alphabet len: 203>
automaton build time: 277.501161ms
automaton heap usage: 14540940 bytes

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.1000000 ./LICENSE-MIT --no-search --kind contiguous
pattern read time: 15.391136ms
[2023-03-22T12:09:41Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:09:44Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 12861890, size: 828984632>
[2023-03-22T12:09:44Z DEBUG aho_corasick::ahocorasick] forcefully chose contiguous NFA
[2023-03-22T12:09:44Z DEBUG aho_corasick::nfa::contiguous] building contiguous NFA
Error: state identifier overflow: failed to create state ID from 27958724, which exceeds the max of 16777215

Ah drats, so it can't be used for the 1,000,000 patterns automaton here. (There isn't a fixed limit on how many patterns it supports. The relevant metric is how big the automaton itself is. If it gets too big, it exhausts its state ID limit. This is a particularly brutal set of patterns because it has Unicode in them, which inhibits alphabet compression optimizations. That in turn has a huge impact on the total size of the automaton.) Maybe I can work on increasing the limit here, but it will result in worse memory usage.

OK fine you say, "I don't mind huge automatons and I don't mind waiting for them!" So let's try the DFA:

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.100000 ./LICENSE-MIT --no-search --kind dfa                        
pattern read time: 3.962522ms
[2023-03-22T12:12:07Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:12:08Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 1533980, size: 98592884>
[2023-03-22T12:12:08Z DEBUG aho_corasick::ahocorasick] forcefully chose DFA
[2023-03-22T12:12:08Z DEBUG aho_corasick::dfa] building DFA
[2023-03-22T12:12:09Z DEBUG aho_corasick::dfa] DFA built, <states: 1533980, size: 1574076076, alphabet len: 203, stride: 256>
automaton build time: 2.006162519s
automaton heap usage: 1574076076 bytes

$ RUST_LOG=debug aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.1000000 ./LICENSE-MIT --no-search --kind dfa                        
pattern read time: 8.923186ms
[2023-03-22T12:12:16Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2023-03-22T12:12:19Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 12861890, size: 828984632>
[2023-03-22T12:12:19Z DEBUG aho_corasick::ahocorasick] forcefully chose DFA
[2023-03-22T12:12:19Z DEBUG aho_corasick::dfa] building DFA
Error: state identifier overflow: failed to create state ID from 3292643584, which exceeds the max of 2147483646

Things start off okay with 100,000 (but look at that memory usage, 1.5GB). But then once you hit 1,000,000 patterns, things fall over because it's just too damn big. In effect, it needs more than 2^31 bits. Oops.

So even if this library did support serialization, you'd be kind of stuck. The only automaton that will even build with 1,000,000 patterns is also the one that wouldn't be able to be serialized anyway. And even if you could serialize it, waiting 3 seconds to build an automaton on the fly for 1,000,000 patterns doesn't sound so bad to me. So why even bother?

You might think regex-automata will save you here, but it sadly probably won't, unless you can constrain your problem in some way. Here's an example to give you an idea, where I compile a DFA using regex-automata, but only using 1,000 patterns:

$ regex-cli debug dense dfa -q -F -f ~/data/benchsuite/texts/wikipedia/titles.1000 -F                
      parse time:  3.750679ms
  translate time:  471.953µs
compile nfa time:  1.035983ms
compile dfa time:  20.623027088s
          memory:  40785684
     pattern len:  1000
      start kind:  Both
    alphabet len:  148
          stride:  256
      has empty?:  false
        is utf8?:  true

(regex-cli is not yet public.)

Look at that. 20 whole seconds to build a DFA with orders of magnitude fewer patterns than what we were able to achieve with Aho-Corasick above. What gives? The problem is Thompson's construction, which generates an absolutely enormous epsilon closure at the start of the automaton. This epsilon closure gets revisited over and over again and slows determinization way down. So in this context, it doesn't even make sense to build a DFA with more patterns unless you're really okay waiting a very very long time.

But, it turns out you can do a little better here if you're willing to smush your patterns into a single regex pattern. So that's pattern1|pattern2|pattern3|...|patternN. This will prevent you from knowing which pattern matched, but it can still tell you where a match occurred, and then since these are just literal strings, you could in theory then do an extra check with a hash table or something to determine which pattern it is:

$ regex-cli debug dense dfa -q -F -f ~/data/benchsuite/texts/wikipedia/titles.1000 -F --combine-patterns
      parse time:  4.201964ms
  translate time:  487.064µs
compile nfa time:  2.601066ms
compile dfa time:  262.823761ms
          memory:  37770304
     pattern len:  1
      start kind:  Both
    alphabet len:  148
          stride:  256
      has empty?:  false
        is utf8?:  true

$ regex-cli debug dense dfa -q -F -f ~/data/benchsuite/texts/wikipedia/titles.10000 -F --combine-patterns
      parse time:  22.872065ms
  translate time:  6.797133ms
compile nfa time:  26.163694ms
compile dfa time:  3.224680719s
          memory:  338803120
     pattern len:  1
      start kind:  Both
    alphabet len:  184
          stride:  256
      has empty?:  false
        is utf8?:  true

$ regex-cli debug dense dfa -q -F -f ~/data/benchsuite/texts/wikipedia/titles.100000 -F --combine-patterns
      parse time:  219.561571ms
  translate time:  67.559931ms
compile nfa time:  262.167931ms
compile dfa time:  32.221479207s
          memory:  2708891732
     pattern len:  1
      start kind:  Both
    alphabet len:  204
          stride:  256
      has empty?:  false
        is utf8?:  true

$ regex-cli debug dense dfa -q -F -f ~/data/benchsuite/texts/wikipedia/titles.1000000 -F --combine-patterns
failed to compile dense DFA: number of DFA states exceeds limit of 2147483647

Welp, so even if you're willing to wait, regex-automata can't help you with 1,000,000 patterns. It's just too big.

plusvic commented 1 year ago

Thank you for the detailed response. Now I have a better picture of how this crate works internally, probably re-building the automaton is good enough for my use case. I'll try to experiment a little bit with real data and figure out how big my automatons are going to be.

To be honest I wasn't thinking about zero-copy deserialization, as I was already guessing that it would be hard to implement and maintain. But given how fast the automatons are built, now I doubt that the deserialization could be faster than re-building the automaton, unless we aim for zero-copy deserialization.

BurntSushi commented 1 year ago

Aye. I will say though, that if one didn't aim for zero-copy but something else, it would still likely be faster than building the automaton from scratch. As long as it's done thoughtfully and you're not parsing each transition into a JSON object or something. But yeah, not totally clear it's worth it. It might be worth it for the DFA, which can take quite some time to build. But when you get long DFA compilation times, you're usually in the space of "the DFA is gigabytes big." It's not clear how useful that particular scenario is.

BurntSushi commented 1 year ago

Note that in #106, I've changed the contiguous NFA to increase the somewhat restricted limit on its total size. Its memory efficiency got a little worse, but now I can build a contiguous NFA from ~15,700,000 Wikipedia titles in about 20 seconds (using 2GB of heap). Previously, I wasn't even able to build one from 1,000,000 Wikipedia titles.

I realize this isn't directly related to serialization, but I just wanted to correct the record above.

plusvic commented 1 year ago

This is a bit out of topic, and you may already know about representing the Aho-Corasick automaton as an interleaved state-transition matrix, but I think you may find this article interesting: http://goo.gl/lE6zG. That's the technique I used in the C implementation of YARA, the project that I'm now moving to Rust.

This comment also explains the technique briefly: https://github.com/VirusTotal/yara/blob/fa5682a98d9421dfd7a1c35370fb7ff2f89ee7ed/libyara/ahocorasick.c#L469-L524

BurntSushi commented 1 year ago

@plusvic Yup! I remember reading that a long time ago. I don't remember why I didn't try it earlier, but the contiguous NFA uses more "classic" finite state machine compression tricks, a lot of which I learned as part of work on the fst crate. The trick is balancing compression with the extra overhead each thing requires at search time. One thing that sticks out to my in your link is this:

// Each 32-bit slot in the transition table has 23 bits for storing the index // of the target state and 9 bits for storing the offset of the slot relative // to its own state.

Which is kind of the thing I just fixed in the contiguous NFA.

It would be interesting to bake off that approach with my approach in the contiguous NFA. And the new crate structure (on master) I think makes it much easier to add new Aho-Corasick implementations.

BurntSushi / aho-corasick

Serializing pre-built automaton #105