Closed doggy8088 closed 4 years ago
Note I would have expected to achieve about what Java is able to get, at least.
.... the fact that IP addresses are so wildly different across the board suggests some engine differences, or something. Certainly, specifying RegexOptions.ECMAScript
cuts the non-ip numbers by about 1/3, which would be telling (still a bad showing, though).
Thank you for pointing us to this benchmark. @ViktorHofer made a change https://github.com/dotnet/corefx/pull/24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.
This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.
We would welcome collaboration on regex performance. Are either of you interested?
The ones in this benchmark look like they might not be "simple enough" however.
The benchmark lists three expressions
Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])
The only one I would not expect the 'simplified' parser to handle is the IPv4 example because it uses the {3} expression to repeat something 3 times (if you had expanded it out three times I would be in my conception of 'simplified'. Of course the whole strategy I laid out was that we would add more complex features as usage dictates.
Personally I don't think the IPv4 regex is great example from software engineering perspective. It is trying to keep the numbers < 256, and that is painful to do in a regex. Instead I would expect people to use a simpler pattern (\d+).(\d+).(\d+).(\d+) convert the strings to number and do the numeric validation as a semantic check.
@danmosemsft said:
Thank you for pointing us to this benchmark. @ViktorHofer made a change dotnet/corefx#24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.
This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.
It lists it as .NET Core, but desktop doesn't do any better (what I'm on at the moment).
@Clockwork-Muse that is interesting. Usually compiled helps. This will need some profiling.
btw, on Mono RegexOptions.Compiled is currently ignored due to a bug (it explains why it's so slow in the benchmark).
One final thing on RegEx. Today when we say 'compiled' we mean generating IL at RUNTIME (using ReflectionEmit). Better than that would be to generate the IL (frankly C#) at COMPILE TIME. This was always a stumbling block before because we did not have ways of injecting tools at build time without users having to do something painful, but Nuget packages support this and we should take advantage of this (frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME. It would be straightforward to generate C# code (it would even be readable), that would recognize the pattern and would be SUPER FAST (and would have a nice comment that shows the Regex it recognizes.
In many other languages,
Compile
is used as theParser
process forRegex
. But aCompiled
option is provided on.NET
. It is easy to be used incorrectly.
I have made a serious
mistake. I used to think that the regular engines
are much the same. Today, after careful testing and found that this is not true.
C#
benchmark speed is consumed in the Match
process, rather than the Parser
process.
It seems that C#
regular engine is really too complicated to achieve, the performance needs to be optimized.
Better than that would be to generate the IL (frankly C#) at COMPILE TIME
Just some additional information.
In .NET Framework we support this by calling Regex.CompileToAssembly(..)
(http://referencesource.microsoft.com/#System/regex/system/text/regularexpressions/Regex.cs,1193) which produces an assembly that can be referenced at compile time. With that you get IntelliSense and all the other benefits from it BUT there is still pain involved as this feature isn't really dynamic at all.
In .NET Core we currently don't support this feature and throw PlattformNotSupportedExceptions (with the latest nightlies) when trying to use the API. We do support IL generation at runtime now but only on netcoreapp applications, not UWP.
What @vancem suggested with NuGet packages sounds interesting. If we want to we could even publish common patterns like email adresses, domains, ip addresses and others as compiled regex assemblies. But that's something for future discussions as we would first need to bring back the mentioned API in .NET Core (which isn't trivial).
Costing covers making a plan.
We have made improvements here but do not plan more for 2.1 that isn't already tracked. Moving to Future to collect future work
Some specific use-cases would benefit greatly from faster Regex.
For example: dev tools/extensions tend to use them extensively to find patterns in code (think // TODO
), filter filenames and so on.
The best performing Regex engine is actually a re-usable C library that includes a JIT, maybe it's worth looking into integrating it rather than writing your own Regex JIT. Under BSD license, github mirror: https://github.com/luvit/pcre2
I know every regex engine has its own flavor, but PCRE is compatible with Perl regexes which are rather broad feature-wise!
@vancem idea of referencing a Nuget package to precompile all static regexes would be awesome when it comes to startup time.
EDIT BTW if you want to do some comparisons, someone wrote a mixed assembly wrapper around PCRE, here: https://github.com/ltrzesniewski/pcre-net
Thanks, we know about both pcre and pcre-net. We had offline discussions about adopting pcre but nothing concrete yet.
Need to mention ripgrep https://blog.burntsushi.net/ripgrep/ It use Rust regex library (based on RE2) and have other optimizations (like Intel Teddy re-implementation). It is worth studying.
In PowerShell repo we discovered that Select-String cmdlet performance is very bad. There are two reasons for this.
(frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME.
The @vancem expectation seems to be true. I would add that we can expect that most regex patterns simple because complex patterns are difficult to write and difficult to debug. (A developer can always break a complex pattern into pieces and apply them consistently without loss of performance) If so, a developer could choose a simpler and faster implementation of RegEx engine. I mean that we could have several different implementations and allow the developer to choose the option that best suits their situation. The options could be simple string search, fastest but feature limited DFA and full feature NFA engines.
@ViktorHofer would your proposed Span based API likely fix the string allocation problem? If so, was the reason that is on pause was that it was feasible to implement for the interpreted codepath but rather expensive to implement for the compiled codepath? If so, I wonder if the benefit is sufficiently large that we could implement it only for interpreted, and throw (for now) for compiled.
@iSazonov just curious, do your regexes tend to be compiled?
@danmosemsft If your question is about a Select-String cmdlet, then I would expect that we need compile regex patterns. (But I see in the cmdlet code that we are not doing this and this should be fixed.)
Compiling a pattern is not necessarily the right choice. I believe they can't be unloaded oncd compiled.
In general, yes, but PowerShell can internally compiles script blocks for performance so compile a regex is not problem. We could get a performance boost for large files especially with TC.
A closer look at ripgrep I discovered that heuristics for selecting one of 7 (!) search methods are used there. https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/src/literal/mod.rs#L39 It takes definitely a lot of work for .Net Core library to catch up with the search leaders.
It seems Span
The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.
Two downsides:
Most regex expressions that occur in practice are supported. FSA matching is extremely fast.
If we care about regex performance we should not try to squeeze out a few percent here or there. The switch to an FSA implementation (for supported expressions which are most) can provide an exponential speedup.
Also, compilation is very important as a constant factor optimization. It was added recently to .NET Core.
I also think that we could have several engines and choose the right one after regex parsing.
I do agree regex performance has fallen behind. The way forward may not be major engineering work in the existing Regex engine. As well as the cost there is some risk of breaking behavior (possibly necessarily).
My thinking was we may want to do a combination of
For 2. I did some experimentation with PCRE as discussed here https://github.com/ltrzesniewski/pcre-net/issues/12 https://github.com/dotnet/corefxlab/compare/master...danmosemsft:pcre?expand=1 Although in that case I stuck it behind the existing API, I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.
cc @ltrzesniewski
The best quality of implementation would be to upgrade the existing API. Exact same syntax and API surface. It might make sense to add a parallel API surface that is more optimized (less allocations etc.).
I understand it's attractive to just drop the existing API and create something new in a greenfield way. The better solution for customers, though, is to avoid that fragmentation and just fix the engine. This requires thorough compatibility work but surely this is possible. Does Microsoft not have access to large code bodies that use Regex? It must be possible to compile a compatibility test suite.
If effort was not a concern we'd create a new fast engine for supported expressions (FSA based with no backtracking) and keep the old engine for unsupported expressions. Almost all practically occurring expressions are formed from a rather simple syntax subset. This would be completely transparent to existing code. Customers do not need to learn new ways of doing things.
just fix the engine
This is not cheap. Writing a new regex engine that is competitive is a big investment and would take time. Where possible it would be good to build on the work others have done and spend our dev dollars in other places. Making it fully compatible is likely an opposing requirement anyway.
Almost all practically occurring expressions are formed from a rather simple syntax subset.
You are quite right - we did an audit of a large quantity of code and this is correct. You can get an idea of the nuances of existing behavior by looking at the tests I disabled in the branch above to get tests passing against PCRE.
I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.
That's understandable if performance is the only thing you're going after, but IMO that would be a missed opportunity. PCRE has a very rich feature set which you can leverage to do things that are just not possible in other engines. That's why I decided to support everything in PCRE.NET - it gives the user so much tools.
There's one feature I left out because I deemed it unnecessary and I had provided a suitable equivalent. And sure enough, just yesterday someone posted an issue asking me to implement it. People want these features. :)
Of course, you could provide simpler APIs that would use faster code paths alongside more feature-complete APIs to get the best of both worlds.
Writing a different API for each regex engine makes sense from a feature availability standpoint.
The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.
PCRE (and thus PCRE.NET) provide both implementations. But in the case of PCRE, the O(N) algorithm is not necessarily faster. Read here for more details, but in short:
It is substantially slower than the standard algorithm. This is partly because it has to search for all possible matches, but is also because it is less susceptible to optimization.
Also, PCRE doesn't implement compiled regexes with this algorithm. Engines like RE2 that are designed from the ground up with this algorithm in mind are probably faster.
I wonder whether it could help to make changes to the "regular expression pattern parser" rather than in the matching engine itself. Sort of like static analysis for regex patterns, and rewriting the patterns to get maximum performance from the existing engine without changing the match results.
Specifically, I'm thinking of functionality like "unrolling the loop", or automatically wrapping some subexpressions in an atomic group to curb unnecessary backtracking if there is no chance of it finding a match anyway. (Please forgive me for the lack of concrete examples here, I am on mobile atm.)
I guess this would only really help when the original author of the expression didn't already try to optimize it, or decided it would be more maintainable in a simpler form etc., but it could perhaps be useful for cases where users can provide their own patterns, what do you think? I imagine the development effort would be a lot smaller than trying to rewrite the match engine or support third party engines? It could even be part of a separate API that developers could opt-in to, in case that is a concern.
That said, one thing I would find useful, and am considering working on myself when/if time permits (and submitting as a NuGet package), would be a "regex AST". (Currently, the patterns that are parsed are stored in classes marked as internal
.) This could power the "static analysis" I suggested earlier, and have the advantage of being able to "rewrite" a pattern originally written for one engine to be suitable for use in another. (i.e. whether \h
means horizontal whitespace or a hexadecimal digit etc.)
The reason I mention this, is that if multiple engines will be supported in future, I imagine it would be beneficial to have this functionality as part of the framework. :)
I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in https://github.com/ltrzesniewski/pcre-net/issues/12.
This would not be affiliated with the official .NET libraries, right? It could be an independent project.
If such a powerful library exists then I agree that there is limited need for improving Regex
. It seems better to recommend to users: If performance is not a concern use the most convenient tool. If performance is important use the alternative engine. It's just a NuGet package away.
As a .NET user I'd rather see the engineering time spent elsewhere given these considerations.
I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in ltrzesniewski/pcre-net#12.
PCRE.NET no longer uses C++/CLI, I've rewritten the interop part entirely so I could target netstandard2.0. It also now supports Windows, Linux and OS X.
I left that issue open for the perf aspect, on which I haven't worked much.
This would not be affiliated with the official .NET libraries, right? It could be an independent project.
I see no problem in principle with it being used in.NET Core itself : the PCRE license is friendly, it is well maintained. we already depend on the native binary for zlib, built from sources in this repo.
I see no problem in principle with it being used in.NET Core itself
From Russ Cox https://research.swtch.com/deps#avoid_the_dependency
Avoid the dependency If a dependency seems too risky and you can’t find a way to isolate it, the best answer may be to avoid it entirely, or at least to avoid the parts you’ve identified as most problematic. For example, as we better understood the risks and costs associated with PCRE, our plan for Google Code Search evolved from “use PCRE directly,” to “use PCRE but sandbox the parser,” to “write a new regular expression parser but keep the PCRE execution engine,” to “write a new parser and connect it to a different, more efficient open-source execution engine.” Later we rewrote the execution engine as well, so that no dependencies were left, and we open-sourced the result: RE2.
We could give more freedom for improvements if we made the regex parser open API. The parser could be made configurable (which features to support). It would allow to create new engines (ex. in CoreFXLab) without regressions in current engine and add new heuristics (in custom code too, not in the repo).
There is no holy grail here. Regex is a complex subject and any solution is made up of compromises.
PCRE2 alone offers NFA, DFA, JIT and a plethora of options. Because none is strictly better than others in all situations. RE2 is designed to execute untrusted Regex from users in linear time, with caps on resources (memory) usage.
If I have special requirements for regex in my application (whether safety, extreme performance, partial matches, etc.) I don't mind picking up a Nuget package that offers a specific engine for that situation.
System.Text.RegularExpressions
is special because it's built into the platform and it will be the go-to solution for most users. It's also the de-facto contract for any library that accept a regular expression as input (e.g. validation, routing, and more).
IMHO as such it's not important that it fits all needs, but it should be able to fit most common needs with great performance.
So for me a first big improvement would be to make the current API faster.
A further improvement could be to introduce new API that are more suited to low-allocations operations (maybe based on Span
, that create captures lazily, etc.). Maybe introduce new features if it makes sense.
A more complex goal would be to design a standard Regex interface that would enable inter-operable regex engines as Nuget packages...
For what it's worth, here's my results (on Ubuntu) using the benchmark at the top of this issue, targeting 2.2 and with the compile flag switched back on (it looks like they removed it).
Total | ||||
---|---|---|---|---|
PHP | 26.28 | 24.16 | 8.12 | 58.56 |
C PCRE2 | 29.38 | 26.50 | 7.24 | 63.12 |
Rust | 31.88 | 29.08 | 8.40 | 69.36 |
Javascript | 94.18 | 66.55 | 2.17 | 162.90 |
Crystal | 174.25 | 155.35 | 19.25 | 348.86 |
D ldc | 174.77 | 173.53 | 4.88 | 353.19 |
D dmd | 259.59 | 242.22 | 7.37 | 509.18 |
Perl | 205.00 | 277.62 | 36.93 | 519.54 |
Python PyPy | 220.50 | 197.99 | 367.01 | 785.50 |
Ruby | 413.85 | 362.56 | 56.83 | 833.24 |
Python 2 | 323.50 | 235.29 | 556.56 | 1115.35 |
Python 3 | 457.60 | 289.60 | 480.00 | 1227.20 |
Kotlin | 441.16 | 542.09 | 356.90 | 1340.15 |
Java | 435.57 | 548.09 | 377.47 | 1361.12 |
Go | 409.63 | 409.72 | 631.65 | 1451.00 |
C# .Net Core | 1506.23 | 1332.92 | 103.60 | 2942.75 |
C# Mono | 3914.31 | 3332.87 | 218.64 | 7465.82 |
As mentioned above, it's best not to read too much into this given this is a simple benchmark. But it's clear there is plenty of scope for improvement.
I notice the input file is UTF8, and .NET (and perhaps Java?) has to process it as UCS-2. Ability to run regex over UTF8 would presumably be helpful.
System.Text.RegularExpressions is special because it's built into the platform and it will be the go-to solution for most users. ... So for me a first big improvement would be to make the current API faster. A more complex goal would be to design a standard Regex interface that would enable inter-operable regex engines as Nuget packages...
Perhaps as @iSazonov suggests we can allow other engines to be plugged in to the existing S.T.RE API, so that existing code could run with little or no modification. This is an approach that has been used in Perl: one can plug in PCRE or RE2 using re::engine::PCRE
and re::engine::RE2
without otherwise changing code (apparently).
As noted above many "everyday" patterns would have much the same semantics in different engines, but not all. One could deal with this two ways: attempt to fall back to the default implementation if this can be reliably determined (this is what the RE2 Perl package apparently does if compilation fails eg because of lookbehind), or opt-in. Opt-in could be done eg with a new RegexOptions
value or other API, or possibly at some higher level such as a thread static. Some of the API would have to selectively throw, eg anything related to Captures objects. It's unfortunate our Regex API surface is so large.
@danmosemsft Could you please share more information about your tests? Is it for Ordinal or OrdinalIgnorecase?
Re Utf8. I made an experiment and did not get a perf win in file read scenario. I guess it is because file reading is more slow than transcoding UTF8 to UTF16. I guess we will see more perf win in in-memory-only scenario.
Re design. Main question is should we implement plugin model or separate APIs.
I ported https://github.com/kokke/tiny-regex-c to C# https://github.com/iSazonov/Regex0. It implements small Regex subset for matching only. The new IsMatch() works in 5-10x faster than C# Regex. I could pull this to corefxlab to discuss development directions.
@iSazonov GitHub gives me a "page not found" error when trying to follow the link to view your C# port of tiny-regex-c - maybe the repo is private atm?
Sorry, fixed.
The .NET Regular Expression engine is not slower. The test includes the RegexOptions.Compiled option. The benchmark test includes the cost of compilation and is only run once. If RegexOptions.Compiled was removed, we would see much more reasonable performance.
Note that there have been some big improvements made to Regex perf in master (since 3.1 forked). Perhaps someone is interested in measuring with those.
@danmosemsft Could you please point the PRs?
@iSazonov these. I do not know whether @stephentoub plans more work.
It would be fun to get more benchmark data, maybe I can do the mariomka benchmark at the top when I have access to that machine at home.
These significant recent wins suggest to me the existing engine is not at the limit of where we can get it. Also, that it may be less important to offer a way to plug in another engine behind the existing API (because the existing API is faster now)
I think the future is likely a combination of both
cc @pgovind fyi.
I do not know whether @stephentoub plans more work.
I do.
recent wins suggest to me the existing engine is not at the limit of where we can get it
I agree. There are still things that can be done to make further meaningful improvements beyond what's already been done for .NET 5.
Cool. I'll try to run the benchmarks next week on my machine to see where we're at.
(been staying off GitHub while on vacation, but I'll be putting up another significant PR early next week)
For what it's worth, here's my results (on Ubuntu) using the benchmark at the top of this issue, targeting 2.2 and with the compile flag switched back on (it looks like they removed it)
@danmosemsft, can you run your test again, but using .NET 5 master?
Pasted from https://github.com/mariomka/regex-benchmark/pull/14#issuecomment-623038745
PHP | 26.57 | 24.27 | 8.20 | 59.05 C PCRE2 | 29.40 | 25.70 | 7.15 | 62.25 Rust | 30.90 | 29.14 | 8.29 | 68.33 C++ Boost | 62.97 | 62.49 | 23.13 | 148.58 Javascript | 95.32 | 67.16 | 2.19 | 164.68 Perl | 151.18 | 90.58 | 33.98 | 275.74 Crystal | 172.62 | 148.42 | 19.01 | 340.04 C# .Net Core 50 ECMA | 215.35 | 172.11 | 17.35 | 404.80 D ldc | 223.99 | 221.51 | 6.71 | 452.21 C# .Net Core 50 | 314.79 | 254.98 | 17.40 | 587.17 D dmd | 292.91 | 303.55 | 8.84 | 605.29 Ruby | 365.49 | 324.14 | 59.95 | 749.58 Kotlin | 237.89 | 300.62 | 408.42 | 946.93 Java | 233.96 | 317.21 | 429.29 | 980.46 Python 2 | 317.74 | 210.41 | 521.91 | 1050.06 Go | 411.11 | 399.17 | 634.96 | 1445.24 C++ STL | 604.55 | 488.13 | 422.20 | 1514.88 C# .Net Core 31 | 1504.76 | 1256.17 | 104.65 | 2865.57 C# Mono | 3923.71 | 3301.79 | 218.56 | 7444.07
I also offered a PR to add RegexOptions.ECMAScript (ie., to give the ECMA results above) since none of the C, Rust, C++ and Java versions are doing Unicode matching.
since none of the C, Rust, C++ and Java versions are doing Unicode matching.
I wonder if Rust does not understand Utf8.
There is a mariomka/regex-benchmark repo that run regex benchmark for different programming languages. I'm just wondered why C#
Regex
performance is the last and way slower than any other programming languages. Is there any way to speed upRegex
performance in .NET or any reason why .NETRegex
is that slow?