BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

big simplification to the leftmost match NFA construction #92

Closed teo-tsirpanis closed 1 year ago

teo-tsirpanis commented 2 years ago

Hello @BurntSushi, I am trying to understand how the NFA states are constructed for leftmost match searching, and I cannot understand these lines from QueuedState<S>::next_match_at_depth.

https://github.com/BurntSushi/aho-corasick/blob/f8197afced3df41c47d05c9bbf8f84ddd197efdb/src/nfa.rs#L979-L981

We are in a trie, so isn't the length of the longest match of a state equal to the state's depth? Won't depth be always equal to 1? Thanks.

BurntSushi commented 2 years ago

Empirically, I believe you're right. And if so, it leads to a pretty huge simplification: #93.

I haven't had time to think through why this is okay and why I made (what appears to be) this conceptual mistake.

BurntSushi commented 1 year ago

So I finally had a chance to look into this and I've convinced myself that not only are you right, but this unraveled into quite a large simplification. I've convinced myself that the only thing we need to care about for leftmost semantics is that we don't insert all failure transitions that appear after a match state. I had previously thought that we needed to keep some failure transitions after a match state. But if you think about it, that doesn't make any sense because any time you follow a failure transition, you end up at a proper suffix of what you've seen so far. And that would in turn imply looking for a match that appear after a match you had already seen, and thus it would not be the leftmost match.

Previously I was doing depth calculations and what not and trying to keep some failure transitions. But as you point out, the depths always cancelled out and this effectively constant folded the code to being equivalent to the simplification above: when we're visiting a match state during failure transition insertion, we simply set its failure transition to the dead state and then move on. This in turn results in all subsequent states getting their failure transitions automatically set to the dead state as well. (Because the failure transition logic always starts by considering the failure transition of the previous state. When that previous state is a match state, since its failure transition is a dead state, this automatically and immediately propagates.)

I spent quite a bit of time trying hard to come up with a test case that would fail here, on the off chance my original idea of keeping some subset of failure transitions after a match state was indeed correct. (Of course, the code didn't implement that because of the issue you pointed out. But perhaps the code should implement that concept.) Alas, I could not. And I could not find any conceptual holes in the "drop all failure transitions after a match state" idea.

So anyway, thank you so much. This is a great simplification in what was one of the more complex areas of this crate!