Explore using Aho-Corasick in Regex's FindFirstChar

stephentoub commented 2 years ago

Given an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named Teddy.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details

Given an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named [Teddy](https://lib.rs/crates/teddy).

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

danmoseley commented 2 years ago

@stephentoub I remember we discussed whether a string.IndexOfAny(string[] ...) would make sense, rather than being regex internal. If I recall correctly, a concern is that the interesting algorithms have some setup cost, such as a table or fingerprint, which might show up as a perf trap if the text is quite short. (They may also benefit from 'policy' such as a table of probable character frequencies which I suppose could possibly show up as a perf deoptimization depending on text.). Possible ways to address this could include (1) different paths depending on pattern/text size (2) accepting multiple texts to amortize the setup cost, or (3) possibly even an API that generates a reusable blob representing the setup (likely not on string). (1) seems appealing to me. One could imagine cases where such an API could be faster even over a single text, than the simple scan that string.IndexOf(string, ..) does. I'm guessing this API could have fairly wide use, beyond ourselves, and could be a great place where we could add value (eg, vectorization) to speed up a common pattern. Do you have any other thoughts? I'm inclined to create an issue to start a discussion as I don't see an existing one.

stephentoub commented 2 years ago

If I recall correctly, a concern is that the interesting algorithms have some setup cost, such as a table or fingerprint, which might show up as a perf trap if the text is quite short.

Yes.

Do you have any other thoughts? I'm inclined to create an issue to start a discussion as I don't see an existing one.

I'd rather focus on getting something working and validated inside of Regex. Regex can do all the precomputation and optimize the provided expression with whatever mechanism it can throw at it. Creating a custom API for a specific pattern of pattern would necessitate data showing it's so common on such hot paths that any additional overhead associated with Regex is prohibitory.

danmoseley commented 2 years ago

@teo-tsirpanis if you're looking for a meaty project that could give us significant regex perf improvements in 7.0 -- perhaps you're interested in creating a vectorized implementation of Aho-Corasick. Teddy is apparently permissively licensed, so it could be a starting point. We would like to do this, but it doesn't fit on the schedule right now.

teo-tsirpanis commented 2 years ago

Thanks for letting me know @danmoseley. It is indeed a very interesting project, but I am pretty busy these days with pressing obligations, included my university classes that started this week and the work on my thesis.

Speaking of this, I had the thought to make this my thesis' subject. I talked with some professors (that was the reason for my delay) and were willing to accept it, but I was warned that "this is much more than an undergraduate thesis, maybe even a postgraduate thesis" and that "it's not something that will finish in 1-2 months" (I linked your message but forgot to explicitly tell them about Teddy though).

To better understand the magnitude of this undertaking, a more detailed list of what needs to be done, as well as a more informed guess of the required time by adding a Cost: label would be very helpful for me. If it turns out to be too much work, I will start studying the algorithm and the Regex codebase on a casual basis until my schedule gets unburdened (perhaps after the 7.0 snap) and when that time comes, I would be more than glad to take a more serious look into this.

danmoseley commented 2 years ago

Yes to be clear, no pressure whatsoever, just flagging as maybe interesting to you or someone. Hmm, it's a significant project but I do not think this should be huge work. For the most part it would be isolated from Regex: it would be implementing an (internal) API essentially like String.IndexOf(string, string, string...) using an already understood algorithm, with a reference implementation. That can be worked on and tested in isolation. The tricky bit will be finding the best way to express it in terms of .NET's vectorization API's as well as performance measurements and tuning. Such changes often need iteration to try different approaches. Once all that is done, it will need wiring up for Regex to call it but that should be relatively localized work and we would give pointers.

@stephentoub thoughts?

stephentoub commented 2 years ago

Once all that is done, it will need wiring up for Regex to call it but that should be relatively localized work and we would give pointers.

Yes...ish. Source generation is a wrinkle. For that, we'd either need to emit the code for the algorithm, which I don't think we want to do, or we would need to expose some public API the generated code could use. And while we could leave such a thing to just be for the non-source generated regexes, that would complicate the story for when to pick source generation over not.

I think the actual story would need to be some sort of API for it, which we'd then use from all the regex engines.

But, just getting to the point where we could test and demonstrate the potential would be valuable. Figuring out the API from there shouldn't be a big deal. It would likely be something where you pass to a factory all the target strings, and then the resulting object would provide a search method that took a span, returned an index, and potentially also provided the target string (or its id) that was found.

danmoseley commented 2 years ago

That makes sense -- thinking of this as an isolated experiment to implement a high performance String.IndexOf(string, string, string...). Once that is proven to work well, we can stop and choose next steps. It may be that we simply add such an API for the benefit of all.

danmoseley commented 2 years ago

(obviously not that signature)

danmoseley commented 2 years ago

Making this concrete in discussion with @stephentoub we would this shape:

internal class MultiStringSearcher
{
    public MultiStringSearcher(string[] strings);
    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
}

Our first implementation would be to use Aho-Corasick with the whole DFA precomputed (looks like #45697 was a start). There would be no grow-up NFA mode: we would presumably fall back to naive search if there are "too many" search strings. We assume that this would be beneficial enough to Regex to stand alone.

Teddy is a natural next step cutting over from Aho-Corasick when hardware supports and there are relatively few search strings (Rust regex seems to cap at 32). As a detail, the Rust Teddy crate itself uses Rabin-Karp for very short strings or (I guess) leftovers.

Note, great credit is due to @burntsushi and others for their work on the impressive Rust crates which help us chart our basic path here.

BurntSushi commented 2 years ago

Since I was pinged, I'll share some very high level thoughts. Please take these with a heap of salt because I have approximately zero familiarity with .NET's regex engine or constraints. :-)

Rust's regex crate Teddy implementation is part of the aho-corasick crate. The teddy crate itself is not used.
Remember that I am not the author of Teddy. I pulled it out of the Hyperscan project (which is where it was born). Its original author is Geoffrey Langdale.
In terms of various constants I used (like when exactly to switch over from Teddy to Aho-Corasick or when to use Rabin-Karp), I'd definitely recommend doing your own litigation there and finding the thresholds that work for you. I do recommend having a diverse benchmark set. It is really easy to get stuck in local optima with things like Teddy for specific benchmarks only to see that speed has regressed quite significantly on some other benchmark.
It seems quite plausible to me that Aho-Corasick's NFA formulation would be useful to you as well. The DFA formulation will be faster, but also eat up quite a bit of memory. The NFA is still likely to be faster than any non-DFA regex engine though.
Beware that an implementation based on the textbook description of Aho-Corasick will report all possible matches, and those matches will not necessarily be consistent with what would be reported by a backtracking regex engine. For example, consider the regex samwise|sam against the haystack samwise. A backtracking regex engine will report samwise as the match, but Aho-Corasick's standard formulation will report sam (or even sam in addition to samwise if you permit overlapping matches). You can actually change the construction of the Aho-Corasick automaton itself to report matches that are consistent with how a backtracking regex engine will report them. See aho-corasick::MatchKind for more details. The source has more details in terms of implementation. (Perl also uses Aho-Corasick, although I've been unable to make heads or tails of the code.)
The above commentary about how Aho-Corasick reports matches is also relevant for Teddy.

That's all I've got for now. Happy to answer questions!

teo-tsirpanis commented 2 years ago

I've started working on this. I will first create a separate project that implements the Teddy algorithm using the cross-platform vector intrinsics and fallbacks to Aho-Corasick. Once benchmarks show an improvement over Regexes, we will look at incorporating the changes.

danmoseley commented 2 years ago

@teo-tsirpanis sounds great. For the sake of maximizing the chance of success and making digestible PR’s I wonder whether it would make sense to put up a change with just A-C first, which I expect would be fast enough to stand alone.

A general purpose API is proposed above. If it’s to satisfy the needs of regex it will presumably need to return leftmost match (as BurntSushi alludes to above; you’ll need to figure out from the Rust create how he achieves this leftmost match behavior since it’s not the textbook A-C behavior.) Presumably leftmost match is a reasonable behavior for a general purpose API.

Also, regex (at least as currently exists) only needs the start index of the leftmost match. It doesn’t need the matched string, and it certainly doesn’t need any other strings (overlapping) matched at that position. Given that, a general purpose API might best only return the index, (the sketch API above suggested both index and match). If I understand A-C correctly, this means building it can be quicker/ simpler - the automaton will not need “suffix links” (aka “dictionary links”) only the “error links”.

So the ordering I’m imagining would be something like

Implement API like above, but only returning index (leftmost index).
demonstrate it is fast enough for some interesting subset of inputs (where the construction time is amortized such that it is significantly better than looping over IndexOf(string) as one must do today) to be worth adding.
Propose the API and get it approved. Merge the new API and implementation.
if time allows, now update regex to take advantage. Demonstrate it’s an improvement. Merge that.
If time and interest allows, continue to improve the implementation behind this API, either by experimenting with DFA vs NFA, with Teddy, or whatever.

Getting the public API in first is a nice sub step but it also avoids the need to emit the implementation from the regex source generator, which can only call public API.

danmoseley commented 2 years ago

(Of course feel free to disagree with suggestion above if you think there’s a better ordering that could still avoid one big PR)

teo-tsirpanis commented 2 years ago

Yes, starting with AC and leaving Teddy for later is a good idea.

If it’s to satisfy the needs of regex it will presumably need to return leftmost match (as BurntSushi alludes to above; you’ll need to figure out from the Rust create how he achieves this leftmost match behavior since it’s not the textbook A-C behavior.)

My implementation remembers the match's location upon encountering one and keeps looping over the characters, accepting subsequent matches only if they occur before the last we accepted. If we return to the root node and have seen a match, we stop the loop and return it. The Rust implementation is doing this by modifying the trie construction; I will investigate if it's worth it.

Given that, a general purpose API might best only return the index, (the sketch API above suggested both index and match).

I think that in a general purpose API returning the matched string is useful; it will be included in my proposal which will be filed soon.

I also ran some benchmarks. I took a random Project Gutenberg book and counted the first $N \in {10, 100, 500, 1000}$ most common English words using interpreted and compiled regexes, and Aho-Corasick. The results are promising:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT

Method	WordCount	Mean	Error	StdDev	Median	Ratio	RatioSD
CountWordsRegex	10	4,043.2 us	42.82 us	40.05 us	4,048.1 us	1.00	0.00
CountWordsCompiledRegex	10	843.0 us	15.47 us	14.47 us	838.6 us	0.21	0.00
CountWordsRegal	10	2,158.4 us	42.78 us	109.67 us	2,109.3 us	0.53	0.03

CountWordsRegex	100	22,434.8 us	443.59 us	865.19 us	22,075.0 us	1.00	0.00
CountWordsCompiledRegex	100	4,408.6 us	86.45 us	146.80 us	4,350.1 us	0.20	0.01
CountWordsRegal	100	3,042.6 us	57.39 us	110.57 us	3,044.9 us	0.14	0.01

CountWordsRegex	500	25,349.7 us	460.47 us	384.51 us	25,325.1 us	1.00	0.00
CountWordsCompiledRegex	500	8,939.9 us	108.08 us	101.10 us	8,928.6 us	0.35	0.01
CountWordsRegal	500	3,526.6 us	70.30 us	65.75 us	3,508.4 us	0.14	0.00

CountWordsRegex	1000	24,369.6 us	167.28 us	130.60 us	24,360.5 us	1.00	0.00
CountWordsCompiledRegex	1000	53,773.9 us	768.32 us	681.10 us	53,771.9 us	2.21	0.03
CountWordsRegal	1000	3,598.2 us	67.88 us	63.49 us	3,587.1 us	0.15	0.00

BurntSushi commented 2 years ago

My implementation remembers the match's location upon encountering one and keeps looping over the characters, accepting subsequent matches only if they occur before the last we accepted. If we return to the root node and have seen a match, we stop the loop and return it. The Rust implementation is doing this by modifying the trie construction; I will investigate if it's worth it.

How does it handle the case when you have the patterns [oob, foobar] and search against foobar? In that case, oob is found as a match before foobar is, but foobar is the correct "leftmost longest" match. So you have to know to keep looking in cases like that. Maybe it works because you aren't returning to the root node in this case?

Another thing to consider is whether you plan on building the DFA variant of AC. In that case, you can't really reason about root nodes and failure transitions at search time.

teo-tsirpanis commented 2 years ago

Maybe it works because you aren't returning to the root node in this case?

Yes, exactly, it returns if it reaches the root or input ends. We can reconsider this once we come at implementing the DFA variant.

joperezr commented 2 years ago

Even though this is being worked on now, it is not tracked as a must-have for 7.0, so changing the milestone accordingly.

steveharter commented 1 year ago

Moving to 9.0

stephentoub commented 10 months ago

superceded by https://github.com/dotnet/runtime/issues/85693

dotnet / runtime

Explore using Aho-Corasick in Regex's FindFirstChar #62447