dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Regex - Support Possessive Quantifiers #24381

Open ViktorHofer opened 6 years ago

ViktorHofer commented 6 years ago

Current popular regex engines like java.util.regex or PCRE support greedy, lazy and possessive quantifiers. The current .NET regex engine does only support the former two. Though possessive quantifiers are syntactic sugar and can be mimicked with atomic grouping today, consider supporting them as they gained popularity over the last years.

Abstract: Possessive quantifiers work the same as greedy quantifiers but without backtracking on the input string. That means that the following pattern D++[A-Z]+ matches the input string DDDDE but not DDDD.

TheConstructor commented 2 years ago

I personally prefer abc*+ over (?>abc*), as it allows switching behaviors quicker and brings the behavior-change closer to the affected operation (the single * in this case).

joperezr commented 2 years ago

Even though Regex doesn't accept that specific notation, you can achieve exactly the same by using Atomic groups, which are supported. As @TheConstructor points out, you could write the example above by doing (?>D+)[A-Z]+. Given there hasn't been a lot of interaction on this issue in the past 5 years, I will go ahead and close it for now until we get more feedback about supporting these quantifier's notation.

TheConstructor commented 2 years ago

@joperezr would you accept a PR that tries to translate x*+ to (?>x*)?

joperezr commented 2 years ago

Our Regex engine at parse time already tries to make groups atomic if possible, so today if you have a pattern like x* we are already translating that to instead be (?>x*), since that is much more performant as it doesn't backtrack. This, however, doesn't happen if the loop can't automatically be made atomic, for example, a pattern like x*\w won't automatically turn the x into an atomic group, since that might conflict with the \w as inputs like xxxx won't match that pattern anymore.

would you accept a PR that tries to translate x+ to (?>x)?

This would require some consideration as it would be a breaking change (Today, this pattern will throw an exception at construction time since today we don't allow nested quantifiers like that unless you use parentheses). I think we should do this only if it is justified by this being a highly used feature, and I don't think that's the case. @stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

stephentoub commented 2 years ago

@stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

Right, the .NET syntax doesn't support the possessive quantifer syntactic sugar, so none of the patterns in our collection will use them, since they all parse correctly.

You could look through the multilingual corpus of regexes in the links from https://github.com/dotnet/runtime/issues/62971 to see how popular possessive quantifiers are there.

TheConstructor commented 2 years ago

@joperezr

This would require some consideration as it would be a breaking change (Today, this pattern will throw an exception at construction time since today we don't allow nested quantifiers like that unless you use parentheses). I think we should do this only if it is justified by this being a highly used feature, and I don't think that's the case. @stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

I put together this short LINQPad-script to get an upper-boundary from the data-set mentioned in #62971:

async Task Main()
{
    var lines = File.ReadLines(Path.Combine(Path.GetDirectoryName(LINQPad.Util.CurrentQueryPath), "uniq-regexes-8.json"));
    var count = 0L;
    var useCount_IStype_to_nPosts = new Dictionary<string, long>();
    var useCount_registry_to_nModules = new Dictionary<string, long>();
    foreach(var line in lines)
    {
        try
        {
            var pattern = JsonSerializer.Deserialize<RegPattern>(line);
            if (pattern.pattern.Contains("++") || pattern.pattern.Contains("?+") || pattern.pattern.Contains("*+"))
            {
                //pattern.Dump(pattern.pattern, collapseTo: 0);
                MergeDictionaries(useCount_IStype_to_nPosts, pattern.useCount_IStype_to_nPosts);
                MergeDictionaries(useCount_registry_to_nModules, pattern.useCount_registry_to_nModules);
                count++;
            }
        }
        catch(JsonException e)
        {
            e.Dump(line, collapseTo: 0);
        }
    }
    count.Dump();
    useCount_IStype_to_nPosts.Dump("useCount_IStype_to_nPosts");
    useCount_registry_to_nModules.Dump("useCount_registry_to_nModules");
}

record RegPattern(string pattern, string[] supportedLangs, string type, Dictionary<string, long> useCount_IStype_to_nPosts, Dictionary<string, long> useCount_registry_to_nModules);

private static void MergeDictionaries(IDictionary<string, long> target, IReadOnlyDictionary<string, long> source)
{
    foreach(var (key,value) in source)
    {
        target.TryGetValue(key, out var targetValue);
        target[key] = targetValue + value;
    }
}

It finds 2088 unique pattern with useCount_IStype_to_nPosts

Key Value
StackOverflowRegexSource 179
RegExLibRegexSource 1

and useCount_registry_to_nModules

Key Value
packagist 889
npm 5830
cpan 293
pypi 159
rubygems 554
maven 113
godoc 146
crates.io 15

What I didn't realize, but the paper found, is that possessive quantifiers using + are supported in Java, PHP, Ruby and Perl, not supported in JavaScript and Go, but worst: the + is seemingly ignored in Rust (see paper page 7 or 449).

I do realize, that supporting this would change the behaviour of the Regular Expression Engine, but I can hardly see where replacing an exception by an actual implementation would yield to severely broken programs. In the end, why would you have a pattern, that always throws an exception in your program? And why is this exception driving your program? There is of course a certain chance of accidental usage in new source code.

Lastly possessive quantifiers are one of the lesser performance hogs (compared to regular quantifiers, recursive patterns or balancing groups), as they rule out backtracking. Having them offers a more concise way of eliminating backtracking than atomic groups, and possessive quantifiers are (in my opinion) easier to maintain, as the are directly obvious at the respective quantifier.

I understand, that they are not high up on your priority list, but I would like to have the opportunity to submit a PR into review.

stephentoub commented 2 years ago

I'm not concerned about taking something that fails to parse and making it valid; that is, for example, the case for pretty much every new C# feature. My concern would be if there's anything that today is valid with one meaning and this would introduce an ambiguity, or if it would lead to any kind of inconsistencies, or if it would cause any meaningful slowdown in the parser. If adding this syntactic sugar doesn't harm anything existing, I'd personally be ok seeing it added, but it's not a priority for our team to implement.

danmoseley commented 1 year ago

@stephentoub since it sounds like we'd take a change for this in principle, can we keep this open in case a community members is interested?

stephentoub commented 1 year ago

If it's doable in non-breaking manner (e.g. something that parses one way today changes tomorrow), sure.