dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

Regex parser is not reducing some loops which causes some patterns to not be able to avoid catastrophic backtracking #73208

Open joperezr opened 2 years ago

joperezr commented 2 years ago

While porting PCRE2 tests, one of the tests which is intentionally crafted to ensure that the engine won't end up in a catastrophic backtracking situation is actually not avoiding it when running it against our .NET engines. The test is:

Regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "(a+)*b");

Most of the rest of the engines will quickly find that the input doesn't match the pattern but we fail to do so and instead end up with catastrophic backtracking. The issue seems to be that when we try to reduce the loops into a simpler pattern like (a)*b we fail to do so given that the a is in a capturing group, so we can't reduce the loops or else that capturing group's result would be wrong in a successful match. This is likely something we may just not fix and be just a difference in behavior between our engines and the rest, but logging in here just in case so we can explore if we can do a bit better for these cases.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
While porting PCRE2 tests, one of the tests which is intentionally crafted to ensure that the engine won't end up in a catastrophic backtracking situation is actually not avoiding it when running it against our .NET engines. The test is: ```c# Regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "(a+)*b"); ``` Most of the rest of the engines will quickly find that the input doesn't match the pattern but we fail to do so and instead end up with catastrophic backtracking. The issue seems to be that when we try to reduce the loops into a simpler pattern like `(a)*b` we fail to do so given that the `a` is in a capturing group, so we can't reduce the loops or else that capturing group's result would be wrong in a successful match. This is likely something we may just not fix and be just a difference in behavior between our engines and the rest, but logging in here just in case so we can explore if we can do a bit better for these cases.
Author: joperezr
Assignees: -
Labels: `area-System.Text.RegularExpressions`
Milestone: -
stephentoub commented 2 years ago

The issue seems to be that when we try to reduce the loops into a simpler pattern like (a)*b we fail to do so given that the a is in a capturing group, so we can't reduce the loops or else that capturing group's result would be wrong in a successful match.

What are the other engines actually doing? I see Java hits catastrophic backtracking. PCRE doesn't. But translating the expression in the suggested manner is invalid, so they can't be doing that. I suspect what they're actually doing is issuing a search for the b to avoid doing work if there's no chance the expression would match. For example, if you tweak your expression from (a+)*b to (a+)*[bc] (just replacing the single b with a character class for b or c), which I'm guessing defeats whatever single-literal search they're doing, then PCRE also hits catastrophic backtracking.

I've considered issuing such a search, but I've been hesitant to do so as in the case of a successful match, it represents useless work; the upside of course is it avoids potentially a lot of work in the case of no match. We might apply additional heuristics to try to decide whether the cost of the literal search is likely to be overshadowed, e.g. if we see non-atomic loops in the pattern, then look for such literals to fail-fast if not found.