Open ibraheemdev opened 9 months ago
Please link the paper.
That benchmark is in this repo. But it's not part of the curated set. Please see contributing guidelines for proposing a change to the curated set.
The paper is here. If it's already part of the repo, I'm not personally sure of whether it should be part of the curated set, I just thought it was an interesting case that might be worth testing.
All righty, now that I have hands on a keyboard, I took a look at this. I defined a new set of benchmarks based on this pattern and varied it a bit. I was in particular interested to see how perf change depending on two factors: whether Unicode mode was enabled and the size of the bounded repeat. Unicode matters here because it changes the meaning of [^u-z]
.
The first thing I did was compare the current release of the regex
crate (1.10.2) with an older version (1.7.3). The older version is likely what was benchmarked in the paper, and it reflect the status quo of the regex
crate before this year and before I effectively rewrote the entire crate.
$ rebar cmp tmp/results.csv -e rust/regex
benchmark rust/regex rust/regexold
--------- ---------- -------------
reported/i13-subset-regex/original-ascii 5.8 GB/s (1.00x) 442.4 MB/s (13.33x)
reported/i13-subset-regex/original-unicode 331.4 MB/s (1.00x) 15.9 MB/s (20.83x)
reported/i13-subset-regex/big-ascii 91.1 MB/s (1.00x) 9.5 MB/s (9.60x)
reported/i13-subset-regex/big-unicode 34.9 MB/s (1.00x) 13.9 MB/s (2.52x)
reported/i13-subset-regex/huge-ascii 15.6 MB/s (1.00x) 8.4 MB/s (1.86x)
reported/i13-subset-regex/huge-unicode 15.6 MB/s (1.00x) 12.4 MB/s (1.26x)
The patterns used are as follows:
*/original-*
: [a-q][^u-z]{13}x
*/big-*
: [a-q][^u-z]{23}x
*/huge-*
: [a-q][^u-z]{80}x
Here, we can see that the latest version of the regex
crate is universally faster. In some cases by a fair margin. But once you get up to the higher bounded repeat, perf levels out a bit and there isn't much difference. The reason for the difference, in very broad strokes, is:
original-ascii
case. That is, it looks for x
and then does a reverse scan to look for a full match. Do note that sometimes the optimization fails. It has guard-rails built into it to avoid quadratic behavior. When this happens, it falls back to a standard forward scan of the lazy DFA. The lazy DFA is pretty fast, so this overall works well and stays fast.original-unicode
case is like the original-ascii
case. Except, when it falls back to the lazy DFA, the lazy DFA eventually gives up entirely and that in turn falls back to the PikeVM.big-ascii
is like the original-unicode
case. It attempts the reverse suffix optimization but that fails occasionally, and since the lazy DFA's cache is filled up quickly, it falls back to the PikeVM. Presumably this happens more quickly because of the bigger bounded repeat (despite the fact that ASCII is used).big-unicode
, huge-ascii
and huge-unicode
benchmarks are more of the same. As the regex grows in size, the lazy DFA quits earlier and earlier. Until ultimately, a ceiling is hit and performance is dominated by the PikeVM.Because of the delicate interaction with the suffix literal optimization, I added another benchmark using the pattern [a-q][^u-z]{80}[x\xE0-\xFF]
and ran it on the regex crate:
$ rebar cmp tmp/results2.csv
benchmark rust/regex rust/regexold
--------- ---------- -------------
reported/i13-subset-regex/huge-ascii-nosuffixlit 15.4 MB/s (1.00x) 8.3 MB/s (1.85x)
reported/i13-subset-regex/huge-unicode-nosuffixlit 15.4 MB/s (1.00x) 12.7 MB/s (1.21x)
In this case, there are no literal optimizations. Performance is unchanged with the huge-ascii
and huge-unicode
benchmarks. This is actually great news, because it means that the literal optimization failing doesn't impact things too much here.
For the regex crate, one can largely "fix" this problem by giving the lazy DFA more cache. Potentially a lot more cache. It's a legitimate work-around, but the cost is more memory.
Now let's compare with some other engines. Given the .NET folks were the ones who published the paper, let's start there (this is .NET 8):
$ rebar cmp tmp/results.csv -e '^rust/regex' -e dotnet
benchmark dotnet dotnet/compiled dotnet/nobacktrack rust/regex rust/regexold
--------- ------ --------------- ------------------ ---------- -------------
reported/i13-subset-regex/original-ascii 12.9 GB/s (1.33x) 17.1 GB/s (1.00x) 11.8 GB/s (1.44x) 5.8 GB/s (2.96x) 442.4 MB/s (39.47x)
reported/i13-subset-regex/original-unicode 13.3 GB/s (1.39x) 18.5 GB/s (1.00x) 11.7 GB/s (1.58x) 331.4 MB/s (57.20x) 15.9 MB/s (1191.66x)
reported/i13-subset-regex/big-ascii 43.5 MB/s (4.22x) 183.4 MB/s (1.00x) 23.7 MB/s (7.75x) 91.1 MB/s (2.01x) 9.5 MB/s (19.34x)
reported/i13-subset-regex/big-unicode 43.5 MB/s (4.24x) 184.6 MB/s (1.00x) 22.0 MB/s (8.39x) 34.9 MB/s (5.28x) 13.9 MB/s (13.29x)
reported/i13-subset-regex/huge-ascii 39.1 MB/s (4.92x) 192.3 MB/s (1.00x) 24.5 MB/s (7.84x) 15.6 MB/s (12.34x) 8.4 MB/s (22.92x)
reported/i13-subset-regex/huge-unicode 39.1 MB/s (4.94x) 192.9 MB/s (1.00x) 24.6 MB/s (7.85x) 15.6 MB/s (12.38x) 12.4 MB/s (15.54x)
This is a rather good showing for .NET. The gap has been closed between regex 1.7 and regex 1.10, but still, the dotnet/compiled
engine is doing quite well here. THe dotnet/nobacktrack
engine does worse (is that what is being measured in the paper? it's a DFA right?), but still better than the regex
crate when the regex
crate cannot benefit from literal optimizations as much. I would be curious to understand a bit more what the dotnet
engine is doing here. I don't know how to profile .NET programs on Linux. It at least seems clear that it's using literal optimizations in the original-{ascii,unicode}
case, but not in the {big,huge}-{ascii,unicode}
cases. I would be curious to find out why.
OK, let's zoom out a bit and get a broader picture of things:
$ rebar rank tmp/results.csv
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
perl 5.38.1 1.38 6
python/regex 2023.10.3 8.89 6
dotnet/compiled 8.0.0 11.54 6
dotnet 8.0.0 35.16 6
dotnet/nobacktrack 8.0.0 52.70 6
hyperscan 5.4.2 2023-04-22 56.88 5
pcre2/jit 10.42 2022-12-11 61.32 6
rust/regex 1.10.0 93.04 6
regress 0.7.1 145.23 6
javascript/v8 21.4.0 219.38 6
java/hotspot 21.0.1+12-LTS-29 258.92 6
python/re 3.11.6 266.10 6
icu 72.1.0 385.06 6
re2 2023-11-01 409.14 6
d/ldc/std-regex 2.105 438.12 6
pcre2 10.42 2022-12-11 459.20 6
rust/regexold 1.7.3 465.28 6
d/dmd/std-regex 2.106 536.71 6
go/regexp 1.21.5 676.16 6
I was not expecting that! perl
is usually quite slow compared to other engines, but it is doing amazingly well here. So is python/regex
(which is also usually pretty slow). Let's drill down a bit:
$ rebar cmp tmp/results.csv -e '^rust/regex$' -e '^dotnet/compiled$' -e perl -e python/regex -e hyperscan -e pcre2/jit
benchmark dotnet/compiled hyperscan pcre2/jit perl python/regex rust/regex
--------- --------------- --------- --------- ---- ------------ ----------
reported/i13-subset-regex/original-ascii 17.1 GB/s (1.00x) 594.9 MB/s (29.35x) 209.4 MB/s (83.38x) 9.9 GB/s (1.73x) 2.2 GB/s (7.79x) 5.8 GB/s (2.96x)
reported/i13-subset-regex/original-unicode 18.5 GB/s (1.00x) 36.3 MB/s (522.01x) 179.3 MB/s (105.76x) 4.7 GB/s (3.92x) 2.1 GB/s (8.66x) 331.4 MB/s (57.20x)
reported/i13-subset-regex/big-ascii 183.4 MB/s (70.56x) 594.9 MB/s (21.75x) 183.8 MB/s (70.42x) 12.6 GB/s (1.00x) 1462.8 MB/s (8.85x) 91.1 MB/s (142.12x)
reported/i13-subset-regex/big-unicode 184.6 MB/s (26.60x) 106.9 MB/s (45.92x) 139.7 MB/s (35.15x) 4.8 GB/s (1.00x) 1447.6 MB/s (3.39x) 34.9 MB/s (140.55x)
reported/i13-subset-regex/huge-ascii 192.3 MB/s (92.10x) 455.5 MB/s (38.89x) 155.0 MB/s (114.27x) 17.3 GB/s (1.00x) 436.1 MB/s (40.62x) 15.6 MB/s (1136.77x)
reported/i13-subset-regex/huge-unicode 192.9 MB/s (13.62x) - 123.4 MB/s (21.31x) 2.6 GB/s (1.00x) 437.5 MB/s (6.01x) 15.6 MB/s (168.67x)
So yeah, perl
is kicking the pants off of everyone here. My guess is that it has a specific optimization pass that is kicking in for this regex.
I had hoped that the bounded-repeat
benchmark in the curated set would be enough to cover cases like this, but I think I'm wrong there. This particular regex
exposes some very interesting behavior from the regex engines, and I think it might actually be a good candidate to move into the curated set because I don't think this is really covered by anything else in the curated set. Thank you for bringing this up.
Here are the benchmark definitions: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e7986a262efe8649aa78c0933/benchmarks/definitions/reported/i13-subset-regex.toml
From Microsoft's paper on their new derivative-based regex engine (
RegexOptions.NonBacktracking
):It might be worthwhile including this case in the benchmarks.