dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.14k stars 4.71k forks source link

Add RegexGenerator support for RegexOptions.NonBacktracking #61903

Open stephentoub opened 2 years ago

stephentoub commented 2 years ago

Currently, if you specify RegexOptions.NonBacktracking with the Regex source generator, e.g.

[RegexGenerator("abc", RegexOptions.NonBacktracking)]
private static partial Regex Example();

you get code like this:

    private static partial Regex Example() => GeneratedRegex_Example_A6B1D425.Instance;

    private sealed class GeneratedRegex_Example_A6B1D425 : Regex
    {
        public static Regex Instance { get; } = new Regex("abc", RegexOptions.NonBacktracking);
    }

In other words, you just get a cached instance of constructing Regex rather than having the implementation actually output in the source. That's because we currently special-case this flag: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L107-L110 due to it not yet being implemented: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L676-L680

We should implement it :smile:, at least for a subset of expressions, e.g. ones where we can fully construct a reasonably-sized DFA graph at build time and emit an executable representation of it into the source.

cc: @olsaarik, @veanes, @joperezr, @danmoseley

ghost commented 2 years ago

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
Currently, if you specify RegexOptions.NonBacktracking with the Regex source generator, e.g. ```C# [RegexGenerator("abc", RegexOptions.NonBacktracking)] private static partial Regex Example(); ``` you get code like this: ```C# private static partial Regex Example() => GeneratedRegex_Example_A6B1D425.Instance; private sealed class GeneratedRegex_Example_A6B1D425 : Regex { public static Regex Instance { get; } = new Regex("abc", RegexOptions.NonBacktracking); } ``` In other words, you just get a cached instance of constructing `Regex` rather than having the implementation actually output in the source. That's because we currently special-case this flag: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L107-L110 due to it not yet being implemented: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L676-L680 We should implement it :smile:, at least for a subset of expressions, e.g. ones where we can fully construct a reasonably-sized DFA graph at build time and emit an executable representation of it into the source. cc: @olsaarik, @veanes, @joperezr, @danmoseley
Author: stephentoub
Assignees: -
Labels: `area-System.Text.RegularExpressions`
Milestone: 7.0.0
LennoxP90 commented 1 year ago

has any headway happened on this subject?

stephentoub commented 1 year ago

has any headway happened on this subject?

No

danmoseley commented 1 year ago

@LennoxP90 can you share what motivates your interest?

LennoxP90 commented 1 year ago

just the non backtracking feature so we can have the best of both worlds fast startup and non backtracking support

danmoseley commented 1 year ago

@LennoxP90 just curious what is your use case for non backtracking - defense in depth against flawed patterns, use of untrusted patterns, better perf in particular case, etc..

danmoseley commented 1 year ago

(Your post makes complete sense of course just interested to learn more about users of this engine)

LennoxP90 commented 1 year ago

i may not need it but i have a regex that determines if an input is a link or not here is my regex: @"\b(?:https?://)\S+\b" I just blanketly set all my regex to use non backtracking for a possible optimization maybe I am using the feature incorrectly I do not know.

joperezr commented 1 year ago

The RegexOptions.NonBacktracking engine's main use case, is to ensure performance that it is linear to the length of the input, which is something that people care about in order to prevent catastrophic backtracking, also known as a ReDOS attack. In other words, there are patterns that have some characteristics which makes them prone to be exploited by specially crafted short inputs which can make a backtracking engine to loop excessively to the point it may crash the process. A non-backtracking engine on the other hand, won't loop in cycles the same way, so the only way to have one of those engines do more work, is literally by having a super large input.

That said, for the vast majority of cases a backtracking engine will be much faster into looking for a match (especially for the cases when a match exists) which is why most people that care about performance and know what their pattern is at compile time should opt for the source generated or compiled engine. For your specific pattern called above, there are only two loops, the optional s in https (which will be made an atomic loop so this makes this loop no longer relevant), and the \S+ part. One potential thing to consider is to remove the \b at the end, as the \S+ will already be a greedy loop and match as many non-whitespace characters as possible until it reaches a boundary. Another option is to mark that part as atomic loop so you prevent backtracking like (?>\S+). In any case, your pattern is not very likely to hit catastrophic backtracking (unless I'm missing something obvious) so if you want an optimized performance, you'd probably want to use the Source generated engine instead of RegexOptions.NonBacktracking.

I believe this is why @danmoseley above asked if your use case involved running unkown or untrusted patterns, which is a very valid reason why to use RegexOptions.NonBacktracking and sacrifice a bit of perf for cases there is a match, for the advantage of reducing the likelihood of having a ReDOS attack. Our colleague @stephentoub wrote a very good explanation of all of this and a description of the NonBacktracking feature in this blog post, I'd highly recommend checking it out 😉