Closed performanceautofiler[bot] closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Seems related to https://github.com/dotnet/runtime/pull/61490. This is the only regression we detected, but it is fairly large (80%)
@stephentoub
This is the only regression we detected, but it is fairly large (80%)
I'm actually pleased it's the only one detected.
I expected and accepted this one in exchange for gains elsewhere. Though looking at it again there's probably somethings we can do to improve it.
The issue here actually isn't what I thought it was.
The core of the issue is: https://github.com/dotnet/runtime/blob/6ff57f1935d1cb3396674bec26268a301e5c0628/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs#L1828-L1832 This test has this sequence:
[\w]+://
Previously we'd use this CanBeMadeAtomic to check whether :
can match \w, find it can't, and thus turn the \w+ loop into an atomic one. However, the use of IgnoreCase in this test is causing an issue. The cited PR optimizes the "://" to be case-sensitive, even in the presence of IgnoreCase, because none of the characters there participate in case conversion. And as a result, CanBeMadeAtomic bails early when it sees that the case-sensitivity of the two nodes differs. So the regression is because the \w+ stays non-atomic and thus causes us to backtrack more than we did previously.
The good news is the problem will just go away with https://github.com/dotnet/runtime/issues/61048. There are workarounds we could put in place in the interim, but I'm tempted to say we should just prioritize fully addressing that issue asap rather than trying to patch things like this.
@joperezr, thoughts?
Let me make sure I understand correctly. The reason why fixing issue #61048 would fix this is because now [\w]+
would now also become case-sensitive even in IgnoreCase because at construction time we would have already translated [\w]
into it's upper-case lower-case equivalent (like transforming [A]
=> [Aa]
) so in theory CanBeMadeAtomic should now return true and reduce again the number of backtracking we do, is that right? If so, I agree it is better to just wait for that to be fixed than trying to undo, or patch the optimizations made with your PR.
@joperezr, exactly. (For this one specific case, it's further interesting because we recognize \w as being case-sensitive even under IgnoreCase, because the categories it maps to already includes both upper and lower. But because of how the parser works, even though [\w] and \w are 100% identical sets, the case-sensitive \w gets added to the set and [\w] ends up as case-insensitive.)
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | performanceautofiler[bot] |
---|---|
Assignees: | - |
Labels: | `arch-arm64`, `area-System.Text.RegularExpressions`, `os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `untriaged` |
Milestone: | - |
Run Information
Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common
Test Report
Repro