dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed joperezr closed 2 years ago

joperezr commented 2 years ago

This issue captures the planned work for .NET 7. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.

Summary

There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.

Planned for .NET 7

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

cc: @stephentoub @danmoseley @olsaarik @veanes

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
# Summary There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land. ## Main Regex Investments - [Non-backtracking Engine] [Investigate adding capture support for RegexOptions.NonBacktracking](https://github.com/dotnet/runtime/issues/61904) - [All Engines] [Augment Regex extensibility point for better perf and span-based matching](https://github.com/dotnet/runtime/issues/59629) - [All Engines] [Improve Regex support for automatically making loops atomic](https://github.com/dotnet/runtime/issues/62449) - [Source Generator with NonBacktracking] [Add RegexGenerator support for RegexOptions.NonBacktracking](https://github.com/dotnet/runtime/issues/61903) - [All Engines] [Overhaul Regex's handling of RegexOptions.IgnoreCase](https://github.com/dotnet/runtime/issues/61048) - [All Engines] [Determine if/how to handle CultureInfo better in Regex (and RegexGenerator)](https://github.com/dotnet/runtime/issues/59492) - [Source Generator] [Developers can precompile their Regex code for faster startup](https://github.com/dotnet/runtime/issues/44676) ## API Proposals - [[API Proposal]: RegexRunner.IsWordChar](https://github.com/dotnet/runtime/issues/62603) - [Obsolete some protected members of Regex{Runner}](https://github.com/dotnet/runtime/issues/62573) - [[API Proposal]: Regex.Count](https://github.com/dotnet/runtime/issues/61425) ## General Perf Improvements - [RegexNode.ComputeMinLength should do better for some Testref/Testgroup nodes](https://github.com/dotnet/runtime/issues/62723) - [Regex compiler/ source generator can use {Last}IndexOf for strings in greedy char loop](https://github.com/dotnet/runtime/issues/62670) - [Regex compiler / source generator can use IndexOf to speed up lazy loop ](https://github.com/dotnet/runtime/issues/62669) - [Regex UpdateBumpalong logic should be applicable to lazy loops as well](https://github.com/dotnet/runtime/issues/62677) - [Regex UpdateBumpalong logic is flawed and likely costing us perf](https://github.com/dotnet/runtime/issues/62676) - [Avoid RegexWriter for RegexOptions.Compiled, RegexOptions.NonBacktracking, and source generator](https://github.com/dotnet/runtime/issues/62450) - [Extend Regex 32-bit/64-bit read optimization to IgnoreCase as well](https://github.com/dotnet/runtime/issues/62459) - [Add Regex FindFirstChar optimization for a setloopatomic followed by a literal](https://github.com/dotnet/runtime/issues/62445) ## General Improvements - [Clean up some RegexNode code](https://github.com/dotnet/runtime/issues/62651) - [Use static abstract interface methods in SymbolicRegexMatcher](https://github.com/dotnet/runtime/issues/62650) - [Resolve inconsistency between \w and \b determination for what is a word character](https://github.com/dotnet/runtime/issues/62619) - [Document SRM regex approach/algorithms/terminology in README.md](https://github.com/dotnet/runtime/issues/60917) - [System.Text.RegularExpressions: Backport MS Docs documentation to triple slash](https://github.com/dotnet/runtime/issues/48972) - I[ndexOutOfRangeException in RegexInterpreter caused by not always resizing its stack appropriately](https://github.com/dotnet/runtime/issues/58786) ## Improve Test Coverage - [Port re2 regex tests](https://github.com/dotnet/runtime/issues/61896) - [Port nim regex tests](https://github.com/dotnet/runtime/issues/61895) - [Port rust regex tests](https://github.com/dotnet/runtime/issues/61894) - [Port pcre2 regex tests](https://github.com/dotnet/runtime/issues/61893) - [Test failure System.Text.RegularExpressions.Tests.RegexCultureTests.TestIgnoreCaseRelationBorderCasesInNonBacktracking(options: NonBacktracking)](https://github.com/dotnet/runtime/issues/60753) - [[iOS/tvOS] RegexCultureTests.Match_In_Different_Cultures test failures](https://github.com/dotnet/runtime/issues/60697) ## Non-Backtracking Engine Fixes and Improvements - [Skip unnecessary RegexNode reductions in NonBacktracking mode](https://github.com/dotnet/runtime/issues/62681) - [Optimize symbolic regex Antimirov mode to avoid so much allocation](https://github.com/dotnet/runtime/issues/60918) - [StressTestDeepNestingOfConcat takes a lot longer on NonBacktracking than other engines](https://github.com/dotnet/runtime/issues/60645) - [Match_Timeout_Loop_Throws test is not timing out with RegexOptions.NonBacktracking](https://github.com/dotnet/runtime/issues/60623) ## Source Generator Fixes and Style Improvements - [Regex source generator should include capture group name in comment](https://github.com/dotnet/runtime/issues/62715) - [Consider removing goto statements on Regex source-generated code](https://github.com/dotnet/runtime/issues/62657) - [Improve single-range IgnoreCase codegen for RegexCompiler / source generator](https://github.com/dotnet/runtime/issues/62647) - [Avoid need for pragma warning disables in Regex source generator](https://github.com/dotnet/runtime/issues/61666) - [Extend Regex alternation source generator optimization for backtracking](https://github.com/dotnet/runtime/issues/62441) - [Extend Regex alternation source generator optimization to small sets](https://github.com/dotnet/runtime/issues/62438) - [Determine if/how to handle default timeouts in RegexGenerator](https://github.com/dotnet/runtime/issues/59491) ## Performance Regressions That Need Fixing - [Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common ](https://github.com/dotnet/runtime/issues/61972) ## Performance Investigations - [Investigate benefit of computing max regex match length](https://github.com/dotnet/runtime/issues/62697) - [Improve Regex's knowledge of captures and atomicity in the tree](https://github.com/dotnet/runtime/issues/62451) - [Explore using Aho-Corasick in Regex's FindFirstChar](https://github.com/dotnet/runtime/issues/62447) - [Experiment with inverting Regex scan loop](https://github.com/dotnet/runtime/issues/62443) cc: @stephentoub @danmoseley @olsaarik @veanes
Author: joperezr
Assignees: -
Labels: `area-System.Text.RegularExpressions`, `tracking`
Milestone: 7.0.0
danmoseley commented 2 years ago

thanks for pulling this together.

A user story is essentially a significant unit of user visible value that can be completed independently.

So this feels like it can probably actually be several issues each labeled with "user story" eg

All the stories would ultimately be parented by a single Theme, not sure what that will be yet.

huan086 commented 2 years ago

Feature request: can a more helpful message be shown when an OutOfMemoeyException is thrown in the regex source generator? Currently it shows OutOfMemoeyException and the generic message that source generation has failed and the code may not work correctly. It would be helpful to link to best practices to simplify the regex, and another link to show how to increase memory limit

Background: I have a 1 million character regex which is a word list in different languages, each word separated by |. It works (slowly) in regex interpreted mode. Hoping to increase performance using source generator but ran into OOM

stephentoub commented 2 years ago

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

huan086 commented 2 years ago

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

This regex labels each location name it finds with the Geoname ID. The interpreted regex typically runs within 5 seconds for a sentence to 60 seconds for paragraph of about 500 words.

Using a regex instead of a loop and IndexOf to take advantage of matching the longest name. E.g. finding the location "New York City" and "New York" will only match "New York City".

Cutting down the number of locations to 1/100 of the list below allows the RegexGenerator to run successfully, creating a source file that is 14MB

https://1drv.ms/t/s!AoUkMG56ao1eq_A4tUTHghOmnndp8A?e=zFPSgH

joperezr commented 2 years ago

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

huan086 commented 2 years ago

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

You are right, turns out Regex isn't the correct solution. A loop on every location name, finding all positions of the location name in text repeatedly using IndexOf, and handling the few cases where locations are substrings of other locations by looking at the positions matched before, runs faster. The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

Originally I thought finding a set of strings within an input text is most straightforward with a Regex.

Back to the original comment I made, it would be helpful to offer some advice when bad regex pattern creates OOM.

joperezr commented 2 years ago

it would be helpful to offer some advice when bad regex pattern creates OOM.

That's good feedback, and we can think of something to solve that. I do think it is very rare to have 1 million character RegEx patterns, so I do think that OOM (at least in most of our engines) should be rare and more of an edge case. For .NET 7, we are introducing a new engine: The Non-Backtracking engine, which uses a DFA and NFA based algorithms underneath, which works mainly as a way to ensure linear processing times in terms of the input. This new engine does have a mechanism similar to what you suggest, and will in fact throw an exception when you call the constructor if your pattern is too large and could cause an OOM exception. We can probably think about sharing this concept of catching potential problematic patterns at construction time for the other engines. For more info on the new non-backtracking engine, and for knowledge on the rest of the improvements we've been working on for .NET 7, I would suggest checking out @stephentoub's regex blog post.

jzabroski commented 2 years ago

@stephentoub I read through your blog post, as @joperezr suggested, and I think there's a feature missing here that I am somewhat surprised didn't get even punted to a .NET 8 backlog somewhere, or at least deliberately listed as "out of scope" for time constraints:

Using Regex named groups, create structured DTOs to save the match data:

// Define DTO that holds match results
public partial class RegexDto
{
}

// Define regex
[RegexGenerator("(?<DayOfWeek>Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", nameof(RegexDto))]
private static partial Regex MyCoolRegex();

// Would generate partial class equal to...
public partial class RegexDto
{
  public Group DayOfWeek { get; set; } 
}

// Perform match operation 
var regexDto = MyCoolRegex().Match(someText);

// New school approach to reading groups
var dayOfWeek = regexDto.DayOfWeek.Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.DayOfWeek.Value)
    : (DayOfWeek?)null);
// Old school approach to reading groups
var dayOfWeek = regexDto.Groups["DayOfWeek"].Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.Groups["DayOfWeek"].Value)
    : (DayOfWeek?)null);

The Visual Studio IntelliSense algorithm already has some basic knowledge around this, but it's currently strings-only. Here is a screenshot of some stupid regex I wrote, to munge some odd SQL tool data format:

image

Additional benefit would be possible eventual removal of intermediary allocations to GroupCollection and CaptureCollection.

Additional benefit would be that IntelliSense doesnt catch you if you copy-paste Regex code from somewhere else, and have a group name that doesnt exist in the regex itself.

Additionally, something I have always found rather silly about how Visual Studio pretty prints regexes, is that it doesn't provide an UI to visually aid in seeing each group as a standalone value. It is thus very hard to see whether parens are balanced or not. I've never understood why it just doesn't work like this, with underlines showing the named group zones:

image

Having a slightly different background color for groups would also be pretty helpful. - like Benjamin Moore Yellow Sorbet or Benjamin Moore Sun Dance.

The only reason I can think of, is analysis paralysis, where developers wondered how to visualize nested named groups. But I'd be willing to bet that most regex's in the wild are relatively flat in nature, as most people tend to think of "Named Groups as Parser Tokens", including myself. - Anything not in a named group is just "epsilon" or non-material "whitespace". For example, by far the most common use I have for regex's is a simple way to match filenames. I'm guessing this is one of the main uses people have for regexes: matching things like file name extension plus some filename prefix plus some formatted date stamp/datetime stamp munged somewhere before the extension, or some kind of monotonically increasing sequence id.

jzabroski commented 2 years ago

The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

This is actually mentioned above: https://github.com/dotnet/runtime/issues/62447

In general, it would be nice if Aho-Corasick itself were a utility class. I had written a post somewhere on reddit on my experiments creating very fast needle in haystack string search using a parallel, C#-based, Aho-Corasick. The actual use case was for something like this. Except, I rather prefer an interface that models the data structure first, and therefore it built a Trie object, vs. generated a switch statement that sliced through the string as though it were a flattened trie. Having a Trie-centric approach would be more useful for things like IDEs where you'd want to be able to redo an algorithm after some inserts.

stephentoub commented 2 years ago

@joperezr, @jeffhandley, I've never been clear on the purpose of these epic issues. Does this need to remain open? As these are just links to other issues, they don't track anything themselves that isn't tracked elsewhere.

joperezr commented 2 years ago

Yes this can be closed. This was more for my personal tracking and planning, but I agree that the remaining work is tracked by individual issues so no point any longer on keeping this open.