dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

Skip unnecessary RegexNode reductions in NonBacktracking mode #62681

Closed stephentoub closed 2 years ago

stephentoub commented 2 years ago

As part of parsing a regex, we perform various transformations on the tree, many of which are focused on creating a tree that will yield more efficient matching with a backtracking engine, e.g. we'll transform "this|that" into "th(?:is|at)". But many of these transformations don't affect a finite automata built from this tree, e.g. "this|that" and "th(?:is|at)" yield identical DFAs. Thus, in NonBacktracking mode, we should avoid spending any cycles doing these transformations that won't make a difference.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
As part of parsing a regex, we perform various transformations on the tree, many of which are focused on creating a tree that will yield more efficient matching with a backtracking engine, e.g. we'll transform "this|that" into "th(?:is|at)". But many of these transformations don't affect a finite automata built from this tree, e.g. "this|that" and "th(?:is|at)" yield identical DFAs. Thus, in NonBacktracking mode, we should avoid spending any cycles doing these transformations that won't make a difference.
Author: stephentoub
Assignees: -
Labels: `area-System.Text.RegularExpressions`
Milestone: 7.0.0
danmoseley commented 2 years ago

Does this show up in a profile such that it's worth the special casing?

stephentoub commented 2 years ago

Does this show up in a profile such that it's worth the special casing?

Yes. And we also already special-case out of necessity, e.g. the analysis we do to automatically apply atomicity are disabled because atomic groups aren't supported in the DFA-based implementation today.

stephentoub commented 2 years ago

After discussions with @veanes and @olsaarik, it seems like in general these optimizations for backtracking are still valuable for non-backtracking, just in order to reduce the lazy construction time. So, we'll leave it all and can just disable ones on a case-by-case basis as need proves.