Open stephentoub opened 2 years ago
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions` |
Milestone: | 7.0.0 |
has any headway happened on this subject?
has any headway happened on this subject?
No
@LennoxP90 can you share what motivates your interest?
just the non backtracking feature so we can have the best of both worlds fast startup and non backtracking support
@LennoxP90 just curious what is your use case for non backtracking - defense in depth against flawed patterns, use of untrusted patterns, better perf in particular case, etc..
(Your post makes complete sense of course just interested to learn more about users of this engine)
i may not need it but i have a regex that determines if an input is a link or not here is my regex:
@"\b(?:https?://)\S+\b"
I just blanketly set all my regex to use non backtracking for a possible optimization maybe I am using the feature incorrectly I do not know.
The RegexOptions.NonBacktracking
engine's main use case, is to ensure performance that it is linear to the length of the input, which is something that people care about in order to prevent catastrophic backtracking, also known as a ReDOS attack. In other words, there are patterns that have some characteristics which makes them prone to be exploited by specially crafted short inputs which can make a backtracking engine to loop excessively to the point it may crash the process. A non-backtracking engine on the other hand, won't loop in cycles the same way, so the only way to have one of those engines do more work, is literally by having a super large input.
That said, for the vast majority of cases a backtracking engine will be much faster into looking for a match (especially for the cases when a match exists) which is why most people that care about performance and know what their pattern is at compile time should opt for the source generated or compiled engine. For your specific pattern called above, there are only two loops, the optional s
in https (which will be made an atomic loop so this makes this loop no longer relevant), and the \S+
part. One potential thing to consider is to remove the \b
at the end, as the \S+
will already be a greedy loop and match as many non-whitespace characters as possible until it reaches a boundary. Another option is to mark that part as atomic loop so you prevent backtracking like (?>\S+)
. In any case, your pattern is not very likely to hit catastrophic backtracking (unless I'm missing something obvious) so if you want an optimized performance, you'd probably want to use the Source generated engine instead of RegexOptions.NonBacktracking.
I believe this is why @danmoseley above asked if your use case involved running unkown or untrusted patterns, which is a very valid reason why to use RegexOptions.NonBacktracking
and sacrifice a bit of perf for cases there is a match, for the advantage of reducing the likelihood of having a ReDOS attack. Our colleague @stephentoub wrote a very good explanation of all of this and a description of the NonBacktracking
feature in this blog post, I'd highly recommend checking it out 😉
Currently, if you specify RegexOptions.NonBacktracking with the Regex source generator, e.g.
you get code like this:
In other words, you just get a cached instance of constructing
Regex
rather than having the implementation actually output in the source. That's because we currently special-case this flag: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L107-L110 due to it not yet being implemented: https://github.com/dotnet/runtime/blob/69b5d67d9418d672609aa6e2c418a3d4ae00ad18/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L676-L680We should implement it :smile:, at least for a subset of expressions, e.g. ones where we can fully construct a reasonably-sized DFA graph at build time and emit an executable representation of it into the source.
cc: @olsaarik, @veanes, @joperezr, @danmoseley