Open am11 opened 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | am11 |
---|---|
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions`, `untriaged` |
Milestone: | - |
The regex generator and RegexOptions.Compiled emit one large matching method; the larger the pattern, the larger the method. This is an obscenely large pattern 😉 . I'd guess the resulting method is so large the JIT ends up disabling important optimizations like inlining.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | am11 |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
Thanks. Yes this is an unusual pattern. :sweat_smile:
Perhaps there are different large method thresholds for JIT and R2R/AOT modes, which can be relaxed for methods with System.CodeDom.Compiler.GeneratedCodeAttribute
? Alternatively, generator can split the matching method based on current runtime's large method limit. :thought_balloon:
I'd guess the resulting method is so large the JIT ends up disabling important optimizations like inlining.
Indeed, the jit doesn't even try optimizing.
Compiling 1054 Runner::TryMatchAtCurrentPosition, IL size = 100905, hash=0xb193db74 Tier-0 switched MinOpts
There are different AOT/R2R limits, because time isn't quite so precious.
Not sure there's anything we can do on the codegen side to mitigate this.
cc @dotnet/jit-contrib
Would it be overly difficult for the generator to do some method splitting to help avoid such large sequences of IL?
Just curious - does it happen with RFC822 regex to verify email addresses? http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
Would it be overly difficult for the generator to do some method splitting to help avoid such large sequences of IL?
It'd be fairly challenging I expect, and certainly won't happen for .NET 7. For RegexOptions.Compiled, there's an added complexity of dealing with DynamicMethods. But the biggest challenge, for both RegexOptions.Compiled and the source generator, beyond coming up with the right heuristic, would be dealing with all of the gotos used to implement backtracking. Those that remain within a sub-method would remain gotos, but those that needed to jump to outside of the new method boundary would need an alternate mechanism.
Using RyuJIT, when expression count exceeds certain threshold, the atomic comparisons tend to get slower with generated regex objects compared to interpreted ones. NativeAOT, however, maintains the "better than interpreted" characteristic of generated regex.
Here is a benchmark project using a huge x86 assembly keywords regex (taken from sharplap frontend app) and comparing with a list of all keywords multiple times: https://gist.github.com/am11/1bc5abf9560dfa3e3e9e401cbc5fe59d.
runtime version: aafa91036e
NativeAOT (expected)
RyuJIT (unexpected?)
category:implementation theme:ready-to-run skill-level:intermediate cost:medium impact:medium