Add Subset Regex to Benchmarks

ibraheemdev commented 9 months ago

From Microsoft's paper on their new derivative-based regex engine (RegexOptions.NonBacktracking):

Incidentally, the result for Rust further highlights the importance of avoiding outliers, as it would be a clear winner if not for the pattern [a-q][ˆu-z]{13}x. The crux of the pattern is that [a-q] denotes a subset of [ˆu-z] much like in the classical example a.{13}. This pattern is challenging for DFA based engines, as the loop can be entered multiple times, leading to a 2^13 factor in number of states. Optimizations in NonBacktracking have made this pattern no longer visible as an outlier.

It might be worthwhile including this case in the benchmarks.

BurntSushi commented 9 months ago

Please link the paper.

BurntSushi commented 9 months ago

That benchmark is in this repo. But it's not part of the curated set. Please see contributing guidelines for proposing a change to the curated set.

ibraheemdev commented 9 months ago

The paper is here. If it's already part of the repo, I'm not personally sure of whether it should be part of the curated set, I just thought it was an interesting case that might be worth testing.

BurntSushi commented 9 months ago

All righty, now that I have hands on a keyboard, I took a look at this. I defined a new set of benchmarks based on this pattern and varied it a bit. I was in particular interested to see how perf change depending on two factors: whether Unicode mode was enabled and the size of the bounded repeat. Unicode matters here because it changes the meaning of [^u-z].

The first thing I did was compare the current release of the regex crate (1.10.2) with an older version (1.7.3). The older version is likely what was benchmarked in the paper, and it reflect the status quo of the regex crate before this year and before I effectively rewrote the entire crate.

$ rebar cmp tmp/results.csv -e rust/regex
benchmark                                   rust/regex          rust/regexold
---------                                   ----------          -------------
reported/i13-subset-regex/original-ascii    5.8 GB/s (1.00x)    442.4 MB/s (13.33x)
reported/i13-subset-regex/original-unicode  331.4 MB/s (1.00x)  15.9 MB/s (20.83x)
reported/i13-subset-regex/big-ascii         91.1 MB/s (1.00x)   9.5 MB/s (9.60x)
reported/i13-subset-regex/big-unicode       34.9 MB/s (1.00x)   13.9 MB/s (2.52x)
reported/i13-subset-regex/huge-ascii        15.6 MB/s (1.00x)   8.4 MB/s (1.86x)
reported/i13-subset-regex/huge-unicode      15.6 MB/s (1.00x)   12.4 MB/s (1.26x)

The patterns used are as follows:

*/original-*: [a-q][^u-z]{13}x
*/big-*: [a-q][^u-z]{23}x
*/huge-*: [a-q][^u-z]{80}x

Here, we can see that the latest version of the regex crate is universally faster. In some cases by a fair margin. But once you get up to the higher bounded repeat, perf levels out a bit and there isn't much difference. The reason for the difference, in very broad strokes, is:

It looks like the regex crate is now a bit smarter about the reverse suffix optimization, which gets triggered in the original-ascii case. That is, it looks for x and then does a reverse scan to look for a full match. Do note that sometimes the optimization fails. It has guard-rails built into it to avoid quadratic behavior. When this happens, it falls back to a standard forward scan of the lazy DFA. The lazy DFA is pretty fast, so this overall works well and stays fast.
The original-unicode case is like the original-ascii case. Except, when it falls back to the lazy DFA, the lazy DFA eventually gives up entirely and that in turn falls back to the PikeVM.
The big-ascii is like the original-unicode case. It attempts the reverse suffix optimization but that fails occasionally, and since the lazy DFA's cache is filled up quickly, it falls back to the PikeVM. Presumably this happens more quickly because of the bigger bounded repeat (despite the fact that ASCII is used).
The big-unicode, huge-ascii and huge-unicode benchmarks are more of the same. As the regex grows in size, the lazy DFA quits earlier and earlier. Until ultimately, a ceiling is hit and performance is dominated by the PikeVM.

Because of the delicate interaction with the suffix literal optimization, I added another benchmark using the pattern [a-q][^u-z]{80}[x\xE0-\xFF] and ran it on the regex crate:

$ rebar cmp tmp/results2.csv
benchmark                                           rust/regex         rust/regexold
---------                                           ----------         -------------
reported/i13-subset-regex/huge-ascii-nosuffixlit    15.4 MB/s (1.00x)  8.3 MB/s (1.85x)
reported/i13-subset-regex/huge-unicode-nosuffixlit  15.4 MB/s (1.00x)  12.7 MB/s (1.21x)

In this case, there are no literal optimizations. Performance is unchanged with the huge-ascii and huge-unicode benchmarks. This is actually great news, because it means that the literal optimization failing doesn't impact things too much here.

For the regex crate, one can largely "fix" this problem by giving the lazy DFA more cache. Potentially a lot more cache. It's a legitimate work-around, but the cost is more memory.

Now let's compare with some other engines. Given the .NET folks were the ones who published the paper, let's start there (this is .NET 8):

$ rebar cmp tmp/results.csv -e '^rust/regex' -e dotnet
benchmark                                   dotnet             dotnet/compiled     dotnet/nobacktrack  rust/regex           rust/regexold
---------                                   ------             ---------------     ------------------  ----------           -------------
reported/i13-subset-regex/original-ascii    12.9 GB/s (1.33x)  17.1 GB/s (1.00x)   11.8 GB/s (1.44x)   5.8 GB/s (2.96x)     442.4 MB/s (39.47x)
reported/i13-subset-regex/original-unicode  13.3 GB/s (1.39x)  18.5 GB/s (1.00x)   11.7 GB/s (1.58x)   331.4 MB/s (57.20x)  15.9 MB/s (1191.66x)
reported/i13-subset-regex/big-ascii         43.5 MB/s (4.22x)  183.4 MB/s (1.00x)  23.7 MB/s (7.75x)   91.1 MB/s (2.01x)    9.5 MB/s (19.34x)
reported/i13-subset-regex/big-unicode       43.5 MB/s (4.24x)  184.6 MB/s (1.00x)  22.0 MB/s (8.39x)   34.9 MB/s (5.28x)    13.9 MB/s (13.29x)
reported/i13-subset-regex/huge-ascii        39.1 MB/s (4.92x)  192.3 MB/s (1.00x)  24.5 MB/s (7.84x)   15.6 MB/s (12.34x)   8.4 MB/s (22.92x)
reported/i13-subset-regex/huge-unicode      39.1 MB/s (4.94x)  192.9 MB/s (1.00x)  24.6 MB/s (7.85x)   15.6 MB/s (12.38x)   12.4 MB/s (15.54x)

This is a rather good showing for .NET. The gap has been closed between regex 1.7 and regex 1.10, but still, the dotnet/compiled engine is doing quite well here. THe dotnet/nobacktrack engine does worse (is that what is being measured in the paper? it's a DFA right?), but still better than the regex crate when the regex crate cannot benefit from literal optimizations as much. I would be curious to understand a bit more what the dotnet engine is doing here. I don't know how to profile .NET programs on Linux. It at least seems clear that it's using literal optimizations in the original-{ascii,unicode} case, but not in the {big,huge}-{ascii,unicode} cases. I would be curious to find out why.

OK, let's zoom out a bit and get a broader picture of things:

$ rebar rank tmp/results.csv
Engine              Version           Geometric mean of speed ratios  Benchmark count
------              -------           ------------------------------  ---------------
perl                5.38.1            1.38                            6
python/regex        2023.10.3         8.89                            6
dotnet/compiled     8.0.0             11.54                           6
dotnet              8.0.0             35.16                           6
dotnet/nobacktrack  8.0.0             52.70                           6
hyperscan           5.4.2 2023-04-22  56.88                           5
pcre2/jit           10.42 2022-12-11  61.32                           6
rust/regex          1.10.0            93.04                           6
regress             0.7.1             145.23                          6
javascript/v8       21.4.0            219.38                          6
java/hotspot        21.0.1+12-LTS-29  258.92                          6
python/re           3.11.6            266.10                          6
icu                 72.1.0            385.06                          6
re2                 2023-11-01        409.14                          6
d/ldc/std-regex     2.105             438.12                          6
pcre2               10.42 2022-12-11  459.20                          6
rust/regexold       1.7.3             465.28                          6
d/dmd/std-regex     2.106             536.71                          6
go/regexp           1.21.5            676.16                          6

I was not expecting that! perl is usually quite slow compared to other engines, but it is doing amazingly well here. So is python/regex (which is also usually pretty slow). Let's drill down a bit:

$ rebar cmp tmp/results.csv -e '^rust/regex$' -e '^dotnet/compiled$' -e perl -e python/regex -e hyperscan -e pcre2/jit
benchmark                                   dotnet/compiled      hyperscan            pcre2/jit             perl               python/regex         rust/regex
---------                                   ---------------      ---------            ---------             ----               ------------         ----------
reported/i13-subset-regex/original-ascii    17.1 GB/s (1.00x)    594.9 MB/s (29.35x)  209.4 MB/s (83.38x)   9.9 GB/s (1.73x)   2.2 GB/s (7.79x)     5.8 GB/s (2.96x)
reported/i13-subset-regex/original-unicode  18.5 GB/s (1.00x)    36.3 MB/s (522.01x)  179.3 MB/s (105.76x)  4.7 GB/s (3.92x)   2.1 GB/s (8.66x)     331.4 MB/s (57.20x)
reported/i13-subset-regex/big-ascii         183.4 MB/s (70.56x)  594.9 MB/s (21.75x)  183.8 MB/s (70.42x)   12.6 GB/s (1.00x)  1462.8 MB/s (8.85x)  91.1 MB/s (142.12x)
reported/i13-subset-regex/big-unicode       184.6 MB/s (26.60x)  106.9 MB/s (45.92x)  139.7 MB/s (35.15x)   4.8 GB/s (1.00x)   1447.6 MB/s (3.39x)  34.9 MB/s (140.55x)
reported/i13-subset-regex/huge-ascii        192.3 MB/s (92.10x)  455.5 MB/s (38.89x)  155.0 MB/s (114.27x)  17.3 GB/s (1.00x)  436.1 MB/s (40.62x)  15.6 MB/s (1136.77x)
reported/i13-subset-regex/huge-unicode      192.9 MB/s (13.62x)  -                    123.4 MB/s (21.31x)   2.6 GB/s (1.00x)   437.5 MB/s (6.01x)   15.6 MB/s (168.67x)

So yeah, perl is kicking the pants off of everyone here. My guess is that it has a specific optimization pass that is kicking in for this regex.

I had hoped that the bounded-repeat benchmark in the curated set would be enough to cover cases like this, but I think I'm wrong there. This particular regex exposes some very interesting behavior from the regex engines, and I think it might actually be a good candidate to move into the curated set because I don't think this is really covered by anything else in the curated set. Thank you for bringing this up.

BurntSushi commented 9 months ago

Here are the benchmark definitions: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e7986a262efe8649aa78c0933/benchmarks/definitions/reported/i13-subset-regex.toml

BurntSushi / rebar

Add Subset Regex to Benchmarks #13