dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Regex pattern returns infinite matches #43314

Open dlclark opened 4 years ago

dlclark commented 4 years ago

Description

Example pattern: (?:(?:0?)+?(?:a?)+?)? Example Input: 0a Should return 2 matches: index 0, length 2; index 2, length 0. Actually returns: infinite matches: index 0, length 2; index 1, length 1; index 1, length 1...etc

Other information

The first match seems to work fine. The second match doesn't.

I did some digging--it may not be helpful, but figured I'd drop my notes in here. It appears that the Lazybranchmark operation pushes some data to regex stack assuming that a backtrace will happen. However, in this example the backtrace doesn't happen and the stack is left in a bad state with the extra item. So by the time Capturemark tries to build the match for Group 0 from the stack it has the wrong value and builds a match based on bad data. This causes our match to appear to move backwards in the string and get stuck with infinite matches.

ghost commented 4 years ago

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley See info in area-owners.md if you want to be subscribed.

danmoseley commented 4 years ago

Reproes on 3.1 and on .NET Framework 4.8: not a regression.

am11 commented 4 years ago

Repros in 5.0 too, this program does not return:

using System;
using System.Text.RegularExpressions;
static class Program 
{
    static void Main() => Console.WriteLine(
        new Regex("(?:(?:0?)+?(?:a?)+?)?").Matches("0a").Count);
}
danmoseley commented 4 years ago

@am11 correct - I should have been clearer - was mainly checking this wasn't urgent, as it might be if this was a break relative to 3.1.

jeffhandley commented 3 years ago

Here's another scenario that was reported by @paulomorgado that leads to an infinite loop (ostensibly):

Regex.Match(
    "RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(System.Runtime.CompilerServices.RuntimeHelpers.TryCode,System.Runtime.CompilerServices.RuntimeHelpers.CleanupCode,System.Object)",
    @"(?<=\()((?<a>\w+(\.\w+)*)(,(?<a>\w+(\.\w+)*)*)?)(?=\))",
    RegexOptions.IgnoreCase)

This was repro'd using:

danmoseley commented 3 years ago

Since it's possible to write regexes that take arbitrary amounts of time (at least with our NFA based engine) in my mind a "hang" is interesting if it's a regression from a previous release, and/or the pattern "should" not involve pathological backtracking.

In the case immediately above, I expect there's substantial backtracking going on, because of the nested *. The first one may just be a bug.

@veanes thoughts about the performance of the regex above in your engine?

GSPP commented 3 years ago

An infinite loop is one thing but there also is a case in the opening post that generates infinite matches. That must be a bug.

veanes commented 3 years ago

In DFA mode in feature/regexsrm:

pattern: ""((?:0)+?(?:.)+?)?"" input: "0a" returns 2 matches: index 0, length 2; index 2, length 0.

pattern: "(?:(?:0?)+?(?:a?)+?)?" input: "0a" returns 2 matches: index 0, length 2; index 2, length 0.

regarding the regex @"(?<=\()((?<a>\w+(\.\w+)*)(,(?<a>\w+(\.\w+)*)*)?)(?=\))" DFA mode does not support positive lookbehind (?<= ...) and it does not support positive lookahead (?= ...) (as of now)

stephentoub commented 2 years ago

With the rewritten RegexCompiler in .NET 7, this no longer repros for either RegexOptions.Compiled or RegexGenerator. It still repros for the interpreter.