Open joperezr opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | joperezr |
---|---|
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions` |
Milestone: | - |
The issue seems to be that when we try to reduce the loops into a simpler pattern like (a)*b we fail to do so given that the a is in a capturing group, so we can't reduce the loops or else that capturing group's result would be wrong in a successful match.
What are the other engines actually doing? I see Java hits catastrophic backtracking. PCRE doesn't. But translating the expression in the suggested manner is invalid, so they can't be doing that. I suspect what they're actually doing is issuing a search for the b
to avoid doing work if there's no chance the expression would match. For example, if you tweak your expression from (a+)*b
to (a+)*[bc]
(just replacing the single b
with a character class for b
or c
), which I'm guessing defeats whatever single-literal search they're doing, then PCRE also hits catastrophic backtracking.
I've considered issuing such a search, but I've been hesitant to do so as in the case of a successful match, it represents useless work; the upside of course is it avoids potentially a lot of work in the case of no match. We might apply additional heuristics to try to decide whether the cost of the literal search is likely to be overshadowed, e.g. if we see non-atomic loops in the pattern, then look for such literals to fail-fast if not found.
While porting PCRE2 tests, one of the tests which is intentionally crafted to ensure that the engine won't end up in a catastrophic backtracking situation is actually not avoiding it when running it against our .NET engines. The test is:
Most of the rest of the engines will quickly find that the input doesn't match the pattern but we fail to do so and instead end up with catastrophic backtracking. The issue seems to be that when we try to reduce the loops into a simpler pattern like
(a)*b
we fail to do so given that thea
is in a capturing group, so we can't reduce the loops or else that capturing group's result would be wrong in a successful match. This is likely something we may just not fix and be just a difference in behavior between our engines and the rest, but logging in here just in case so we can explore if we can do a bit better for these cases.