BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Add interleaved NFA implementation #110

Closed vthib closed 1 year ago

vthib commented 1 year ago

This MR brings the implementation of another "NFA but optimized" automaton. This ones is greatly inspired from https://github.com/mischasan/aho-corasick (http://goo.gl/lE6zG), which i've seen being discussed not too long ago.

My use-case comes from https://github.com/vthib/boreal, and is mostly the same as the one from @plusvic (who might be interested in this MR for Yara-X as well): extract a bunch of literals from strings that are searched, and use a Aho-Corasick to match on those literals on a single pass against files that can tend to be quite big.

Using this awesome crate got me quite far, but investigating performances issues that I had when the number of literals gets big, I tried to implement this new NFA version to see if it brings good improvements.

Turns out, it does! When the number of literals gets quite big (around 5k on my computer, but I suppose this will depend on the processor's caches), this interleaved implementation is:

A few examples that I have in my own benches are as follows:

13155 non-unique literals between 2 to 4 bytes long, against a 56MB file:

31831 non-unique literals between 2 to 4 bytes long, against a 56MB file:

This however can vary quite a bit. When comparing performances on a 150MB file with the same literals with aho-corasick-debug I have the DFA faster that the interleaved implementation, but when doing the exact same scanning, but inside my own project, performances are relatively similar. I haven't profiled much to understand what is going on exactly.

When the number of literals is not that high however, it tends to do much worst than the DFA, and be very similar to the contiguous NFA implementation, with a smaller memory footprint, but generally longer build times.

The other drawback is the same as the one for the contiguous implementation, limitation on the number of states that can be used. The max state id is 2**23-1, but given how the automaton size manages to stay quite small, this may still work for a quite large number of patterns.

Let me know if you are interested in this, I do not know if you are interested in adding more implementations. I think the performances are very interesting and would be very useful in some usecases such as mine, where i'll probably go for a DFA when the number of patterns is small, then use an interleaved NFA when it gets bigger. The very small memory footprint could also be useful for some people that may care more about a limited memory footprint than a as fast as possible scan time.

There may be adjustments to make. I've used a HashSet in the implementation which i'm not sure are desirable or not, and there may be a few other things to improve. I don't think I perfectly understood how the "NFA::FAIL" state works, and there should be improvements to make related to it.

I've tried to document as much as possible how it works, implementation is definitely not trivial and i've got quite a few difficulties when trying to understand how the original C implementation works. Making it fit the ordering of states for fast "is_special" or "is_match" calls was also non trivial, as it didn't really fit the design of the original paper.

I've also added a bench to show a bit how this implementation shines. All of the other benches tend to work on rather small numbers of patterns, where this implementation is often much worse than others, apart from the automaton size.

Also, i'd like to thank you for your work on this. I've been able to use this crate for quite some time and benefit from its nice design and performances without any work from my part. But getting into its codebase to see if I could add the interleaved design, I was amazed by the quality of the code and the documentation throughout the whole codebase. I don't think it's an exageration to say this is probably the most well documented and cleanest code i've ever read! Understanding how the crate works, adding a new implementation, testing and debugging it was made very easy. So thank you :)

BurntSushi commented 1 year ago

Haven't looked at the code yet, but could you provide instructions for reproducing the benchmarks in your PR comment? Thanks!

And this looks like very interesting work. I'll need to take a closer look myself. The automaton sizes in particular look very impressive, so I would like to better understand what's going on there.

vthib commented 1 year ago

Here are the literals I used for the benchmarks i quoted:

literals1.txt literals2.txt

I used aho-corasick-debug with standard match kind to get the build time, scan times and automaton size, but it needs a patch to be able to use patterns with non printable bytes:

0001-use-hex-dictionary-in-aho-corasick-debug.patch

Finally, the input file i used is a Firefox Installer: https://download-installer.cdn.mozilla.net/pub/firefox/releases/107.0/win64/en-US/Firefox Setup 107.0.msi. A bigger file I also used that i mention has DFA be quicker than the interleaved implementation is the behemoth 157M interactive_ui_tests.exe from chrome.

Those all come from boreal benchmarks. literals1.txt are literals extracted from the rules of the icewater repository, and literals2.txt are literals extracted from the rules of the signature-base repository.

BurntSushi commented 1 year ago

Okay, so I finally had a chance to look at this. I ran my own tests. The patterns are a random subset of Wikipedia article titles and the haystack is a huge collection of various English subtitles. I can provide both if necessary. My first test is on 100,000 patterns:

$ aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.100000 $medium --match-kind leftmost-longest --kind interleaved 
pattern read time: 4.2466ms
automaton build time: 443.273059ms
automaton heap usage: 16309232 bytes
match count: 111401
count time: 721.26426ms

$ aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.100000 $medium --match-kind leftmost-longest --kind contiguous 
pattern read time: 4.132108ms
automaton build time: 263.59873ms
automaton heap usage: 20886480 bytes
match count: 111401
count time: 869.521711ms

So the interleaved NFA is a little better in terms of search a decent chunk better in terms of memory usage. So the question for me, should we just drop the contiguous NFA (since 1.0 isn't out yet) and use the interleaved NFA instead? Well, part of my recent work on the contiguous NFA is making it can handle a very large number of patterns. So let's bump it up to 1,000,000:

$ aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.1000000 $medium --match-kind leftmost-longest --kind interleaved                                   
pattern read time: 15.803934ms
Error: state identifier overflow: failed to create state ID from 26133178, which exceeds the max of 8388607

$ aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles.1000000 $medium --match-kind leftmost-longest --kind contiguous 
pattern read time: 11.211775ms
automaton build time: 2.917416497s
automaton heap usage: 168274740 bytes
match count: 2344317
count time: 1.147023733s

This to me is the unfortunate part. Prior to a few weeks ago, the contiguous NFA also couldn't build an automaton with this particular set of patterns. Indeed, if we use that older implementation to build the original 100,000 patterns:

$ aho-corasick-debug_2023-03-22-before-contiguous-changes ~/data/benchsuite/texts/wikipedia/titles.100000 $medium --match-kind leftmost-longest --kind contiguous 
pattern read time: 4.404141ms
automaton build time: 258.515821ms
automaton heap usage: 14535888 bytes
match count: 111401
count time: 897.321019ms

Then we see that its memory usage is actually even better than the interleaved NFA. There is actually a relationship between memory usage, the total automaton size supported and even search time. All three have to be balanced. I ended up making the choice to increase memory usage in exchange for supporting a larger automaton size. Otherwise, if we limit ourselves to a smaller capacity, the contiguous NFA can be even more memory efficient than the interleaved NFA.

So that at least answers, for me, the question of whether to replace the contiguous NFA with the interleaved NFA. That leaves us with the question of whether to add the interleaved NFA, which would bring us to 3 total. I actually think the answer to that is no, because the interleaved NFA doesn't represent enough of an improvement IMO to justify its existence. There just is not really a clear story I can put in the docs (or even in the code heuristics) for saying when one should choose the interleaved NFA versus the contiguous NFA. And even if we could, the benefit is not really that big. It's big enough to potentially justify replacing the contiguous NFA, but only if it didn't have other costs.

One thing that would be really nice is if the contiguous NFA's search routine could be better optimized. I suspect the reason why the interleaved NFA does better here is because there is just less to do in order to compute the next state. The contiguous NFA has to deal with more stuff, and I suspect the code generator is not handling it gracefully. But I haven't been able to really get to the bottom of that yet.

I'm going to close this out for now. Thank you for working on this! It really is an interesting point in the Aho-Corasick design space!

vthib commented 1 year ago

Thanks for the response. Would it be possible to get your benchmarks data? It would be interested in reproducing this and trying to understand how it differs from my benchmarks cases. I thought that even though the max sid is lower for the interleaved version, the interleaving would have made this limit quite hard to reach in practice.

I also wonder how much the differences between your tests and mine impact on the performances of both nfa implementations. From what I can tell:

BurntSushi commented 1 year ago

For the patterns:

The last one contains "all" of the titles. But it was extract some number of years ago from Wikipedia's XML dump IIRC. It contains almost 16,000,000 titles. The contiguous NFA has enough capacity to deal with even that (although it takes quite some time to build):

$ aho-corasick-debug ~/data/benchsuite/texts/wikipedia/titles $medium --match-kind leftmost-longest --kind contiguous
pattern read time: 121.585931ms
automaton build time: 20.484044217s
automaton heap usage: 1954263968 bytes
match count: 24383713
count time: 1.96672193s

The haystack I used is OpenSubtitles2018.raw.sample.medium.en.

The original paper for the interleaved implementation mention optimizing for tails of patterns, basically extracting them out of the automaton. My initial intuition about this is that there are diminishing returns in putting the whole patterns in the automaton, regardless of the final size. While starting states are extremely dense, some tails can be quite long and unique, and matching those unique tails directly seems (again, intuitively) much more efficient. I haven't tested it thoroughly however, this seems like an interesting question to investigate.

Yes, the contiguous NFA specifically optimizes tails:

https://github.com/BurntSushi/aho-corasick/blob/beb163a62eaa5a4d1cf71054451371668e1e7654/src/nfa/contiguous.rs#L420-L431

Changing how the tails are matched is an interesting idea, but I suspect it would probably require changing the search routine itself. That would be best experimenting with outside of the crate.

I want to be clear that I took your measurements into account when making a decision too. Remember, there are two critical points here:

Note that I am going to have very limited bandwidth for digging into the details here. I am trying to get aho-corasick 1.0 out the door, and then get a very large release of the regex crate out that I've been working on for years. It will probably be quite some time before I can realistically swing back around to this crate to do any meaningful work.

vthib commented 1 year ago

Thanks for the data, i'll try to play with those a bit.

I completely understand your decision. I do think it's the right call, even in my tests the interleave implementation is not entirely convincing, I do have some great performance improvements with it in some cases, but not at all in other cases and deciding when to use it is not clear. I definitely think more work and investigation is needed to understand where its strength lies, and replacing it or adding it would not be a right call just before a 1.0.

Thanks a lot for all your work, i'm looking forward to the 1.0 and the next regex release. I've also been using the regex-automata 0.2 release recently while trying to find the right solution to some of my performance issues. I'll hope to be able to discuss those with you at some point, but I still have a lot of profiling to do.

BurntSushi commented 1 year ago

Yeah I'd be happy to help when I can. I'm going to be publishing a regex benchmark in the coming months (maybe within a month? I dunno, I still have a lot of writing to do), and I'd encourage you to startup a Discussion there (which will be enabled). In that context, we can look at your problem more holistically.

You could also open a Discussion on the regex repo any time if you want to chat sooner.

BurntSushi commented 1 year ago

Also, if you really want to go off the deep end, I might suggest looking into Hyperscan, and specifically, its "FDR" algorithm. It's described a little bit here: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf

The implementation is also open source: https://github.com/intel/hyperscan

... but good luck penetrating the code. I did it once to extract Teddy and the fruits gained were very much worth it. In some ad hoc testing, I very much suspect that FDR is worth quite a bit more fruit too. Teddy works well for a small number of literals where as FDR does well with a lot. And it scales better than Aho-Corasick. Whether it works better for your specific use cases or not, I'm not sure.

I actually do want to bring FDR into this crate some day, assuming it can be made to work within this crate's context. For example, by providing leftmost-first match semantics.