dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.22k stars 4.72k forks source link

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

Open jzabroski opened 6 years ago

jzabroski commented 6 years ago

From MSDN:

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

This is by far one of the biggest gotchas with using .NET Regex class.

I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.

Portability concerns According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.

The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.

Backward compatibility concerns Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.

Here is Petr Onderka (@svick)'s summary:

OS Line-ending style Current Environment.NewLine AnyNewLine
Windows Windows
Windows Unix
Unix Windows
Unix Unix

Api Proposal

edit by @ViktorHofer

namespace System.Text.RegularExpressions
{
    [Flags]
    public enum RegexOptions
    {
        None                    = 0x0000,
        IgnoreCase              = 0x0001, // "i"
        Multiline               = 0x0002, // "m"
        ExplicitCapture         = 0x0004, // "n"
        Compiled                = 0x0008, // "c"
        Singleline              = 0x0010, // "s"
        IgnorePatternWhitespace = 0x0020, // "x"
        RightToLeft             = 0x0040, // "r"

#if DEBUG
        Debug                   = 0x0080, // "d"
#endif

        ECMAScript              = 0x0100, // "e"
        CultureInvariant        = 0x0200,
+       AnyNewLine              = 0x0400 // Treat "$" as (?=[\r\n]|\z)
    }
}

API Review Notes

Video

Looks good. A few comments:

PR Review Notes

After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning \Z anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.

Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.

AnyNewLine as Anchor Modifier to \Z and $ Anchors

flags $ is treated as $ documentation \Z is treated as \Z documentation
neither (?=\n\z\|\z) The match must occur at the end of the string or before \n at the end of the string. (Same as $ with this option.) (Same as $ with this option.)
RegexOptions.Multiline (?=\n\|\n\z\|\z) The match must occur at the end of the string or before \n anywhere in the string. (?=\n\z\|\z) The match must occur at the end of the string or before \n at the end of the string.
RegexOptions.Multiline \| RegexOptions.AnyNewLine (?=\r\n\|\r\|\n\|\r\n\z\|\r\z\|\n\z\|\z) The match must occur at the end of the string or before \r\n, \n or \r anywhere in the string. (?=\r\n\z\|\r\z\|\n\z\|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string.
RegexOptions.AnyNewLine (?=\r\n\z\|\r\z\|\n\z\|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string. (Same as $ with this option.) (Same as $ with this option.)
svick commented 6 years ago

I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine.

Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not):

OS Line-ending style Current EnvironmentNewLine AnyNewLine
Windows Windows
Windows Unix
Unix Windows
Unix Unix
jzabroski commented 6 years ago

Petr, that is exactly what I meant if you looked at how I defined it: (?=[\r\n|\z) can match any mix of line endings, even files trashed by mixed line endings. AnyNewline is a MUCH better name than what I came up with. Love teamwork!

Adding a portable sigil for Environment.NewLine probably isn't a bad idea, but the only pragmatic way I could think of doing it would be to export all Regex sigils and let the consumer of the API reprogram the sigil mapping. I thought about this, but felt it was too complicated and also ruin the tooling around current .NET regular expressions. Sites like RegexHero make writing Regex so easy that I wanted a solution that would be easy for tools to adopt.

The only gotchas I typically encounter are forgetting to escape a pipe, and then forgetting about how line ending semantics works. Tools help me with the former, but the latter always trips me up.

On Sat, Mar 24, 2018, 8:32 AM Petr Onderka notifications@github.com wrote:

I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine.

Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not): OS Line-ending style Current EnvironmentNewLine AnyNewLine Windows Windows ✗ ✓ ✓ Windows Unix ✓ ✗ ✓ Unix Windows ✗ ✗ ✓ Unix Unix ✓ ✓ ✓

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/28410#issuecomment-375884537, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbT_WOjuiqujVh1zPVyNau26LElEe-aks5thjzMgaJpZM4S473p .

jzabroski commented 6 years ago

Actually, I see now that in my original description my description of the behavior is inconsistent. Thank you for that wonderful table. I will update the first post with it.

jzabroski commented 6 years ago

@jnm2 @svick I updated the original description. Please give feedback.

ViktorHofer commented 6 years ago

I like the idea and its cross-platform benefit.

I think a setting that would match any line ending

How is any line ending defined? The name suggests that it can handle any line ending, are there more than Unix and Windows line ending chars? We could probably refine the name.

svick commented 6 years ago

@ViktorHofer

How is any line ending defined?

According to Wikipedia's summary of the Unicode standard, there are 8 character sequences that form a line break (7 single characters, including CR and LF, and the sequence CR+LF).

Though I think those other characters are not widely used, so supporting only either LF and CR+LF (Unix and Windows) or CR, LF and CR+LF (old Mac, Unix and Windows) would be fine, especially if it lead to better performance than fully supporting the Unicode definition of line break.

@jzabroski

(?=[\r\n]|\z)

I don't understand this regex, as far as I can tell, it wouldn't match \r\n, and so \r\n would be reported as two line breaks.

jzabroski commented 6 years ago

I'll write some test cases with XUnit and use that as a specification. We can then tweak the test cases to come up with a final specification/acceptance criteria. Fare could also be used to confirm.

ViktorHofer commented 6 years ago

potential hackathon candidate if we review the API addition in time.

cc @karelz

terrajobst commented 6 years ago

Video

Looks good. A few comments:

svick commented 6 years ago

@terrajobst

The table looks wrong (Windows on Windows on the Current should work IMHO)

I don't think the table is wrong, see the documentation linked in the top post, where it says that you should use \r?$ to match Windows-style line endings in multiline mode.

You can also try running code like the following:

var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);

foreach (char c in match.Value)
    Console.WriteLine($"{c} (0x{(int)c:X2})");

For me, running it on .Net Core 2.1 RTM on Windows 10 prints:

f (0x66)
o (0x6F)
o (0x6F)
 (0x0D)

While if this was "working well", I would expect to get just the three characters of "foo" as the match, not "foo\r".

jzabroski commented 6 years ago

@terrajobst I agree with @svick . Intuitively, all his table is saying is that neither $ nor Environment.NewLine is fully portable solution for the two ubiquitous platforms. AnyNewLine therefore proposes to fill that gap.

jzabroski commented 6 years ago

@terrajobst

I'll update the original post to modify the enum options to be 0x0400 (1024)

Mpdreamz commented 6 years ago

Taking a stab at this for the Hackaton

ViktorHofer commented 6 years ago

Ok, assigned you.

Mpdreamz commented 6 years ago

Had some fun this afternoon going through the internals of System.Text.RegularExpressions

I pushed what I have so far here: https://github.com/Mpdreamz/corefx/commit/4c4d6882802feae0a50e19dd18033ab6af648270

I might pick it up for fun at some point again, feel free to unassign me 😄

ViktorHofer commented 6 years ago

wow that's great progress! it's up to you if you want to continue working on it or pass it on to me / other community members.

danmoseley commented 5 years ago

Unassigning for someone else to look. Thanks for the start!

shishirchawla commented 5 years ago

Hi @danmosemsft, I would like to have a go at this. Can you please get this assigned to me. Thanks.

danmoseley commented 5 years ago

@shishirchawla sounds great, go ahead! BTW I sent you a collaborator invite. When you accept LMK and I will assign this formally.

danmoseley commented 5 years ago

Assigned, it's all yours @shishirchawla work on any cadence you want and just post here if you have any questions or issues.

shishirchawla commented 5 years ago

Hi @danmosemsft / @jzabroski ,

Heres my PR for the new API - https://github.com/dotnet/corefx/pull/41195 . This is my first PR on the project, so please bare with me.

Looking forward to your feedback.

Shishir

jzabroski commented 5 years ago

@danmosemsft @terrajobst @ViktorHofer @svick It seems that during the process of implementing this feature, @shishirchawla discovered an area to improve the API a tiny bit further. If you go to the Anchors documentation, there are two anchors, $ and \Z, which mention Unix-specific \n line ending. I've updated the top post of this issue with the full notes, including @terrajobst API review notes, so that everything is consolidated in one post.

I like @shishirchawla 's improvement, but I think the final focus should be on how this is communicated in the Docs, such as https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions and ensuring full test coverage.

danmoseley commented 5 years ago

@terrajobst could you please give your take on the \Z change? Does this need to go back to review or is it arguably what was intended?

jzabroski commented 5 years ago

@danmosemsft So sorry I'm a foot dragger on this. Been trying to keep up with my own open source project, life, work, etc.

Relevant examples / test cases:

https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-line- https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-before-ending-newline-z https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-only-z

The examples are a bit idiosyncratic in that they reference Environment.NewLine, and it probably isn't immediately clear to the documentation writer that the example output will change depending on the runtime OS environment.

danmoseley commented 5 years ago

ping @terrajobst

terrajobst commented 4 years ago

The name AnyNewLine works for me. Unless someone of @dotnet/fxdc disagress, I wouldn't bring this back to API review. Our feedback was mostly around semantics, and that seems to have been take care of.

danmoseley commented 4 years ago

@terrajobst The question wasn't about AnyNewLine (that was what was originally approved) but confirming that you expected the new behavior for \Z as well as $. I cannot imagine you don't...

jzabroski commented 4 years ago

I still owe you all some sample documentation, I think, too... been stuck on various other projects and yak shaving.

danmoseley commented 4 years ago

@jzabroski @shishirchawla , @terrajobst remark is sufficient for me. We can assume \Z is desired as well.

jzabroski commented 4 years ago

@danmosemsft Great - @shishirchawla still needs to update the PR with the review comments. I do think we want to explicitly cover the test case given by @svick earlier, as its exactly the scenario that confuses everyone. My head hurts every time I try to predict the right answer after not looking at this thread for a week.

jzabroski commented 4 years ago

@danmosemsft @shishirchawla Updated the this Issue Title to reflect reality.

danmoseley commented 4 years ago

I was working to rebase #1449 on master, but it needs partly reimplementing at this point. And I realized there are some omissions, but a larger issue is that adding support in the same way for the other 3-5 kinds of newlines in TR18#1.6 will make the code unwieldy. Currently \n is already specially treated in 13 places in the interpreter, compiler, and each of those would need an update to recognize the other line endings, lengthening hot code paths, with on top of that the special complication that \r\n is two characters and should be treated as one.

I thought about this for some time and realized that we could instead "lower" several of these into equivalent primitive components, so that nothing but the parser need know about them. Here is how they would "lower", as best I can determine:

flags ^ lowered to (no capture) $ lowered to (no capture) \Z lowered to (no capture)
neither \A (?=\z\|\n\z) (Same as $ with neither)
RegexOptions.AnyNewLine (new) \A (?=\r\n\z\|[\r\u0085\u2028\u2029]?\z)\|(?<!\r)(?=\n\z) (Same as $ with RegexOptions.AnyNewLine)
RegexOptions.Multiline (?<=\A\|\n) (?=\z\|\n) (Same as $ with neither)
RegexOptions.Multiline \| RegexOptions.AnyNewLine (new) (?<=\A\|\r\n\|[\n\u0085\u2028\u2029])\|(?<=\r)(?!\n) (?=\r\n\|[\r\u0085\u2028\u2029]\|\z)\|(?<!\r)(?=\n) (Same as $ with RegexOptions.AnyNewLine)
flags . lowered to (it captures)
neither [^\n]
RegexOptions.AnyNewLine (new) [^\n\r\u0085\u2028\u2029]
RegexOptions.SingleLine .
RegexOptions.SingleLine \| RegexOptions.AnyNewLine (new) (\r\n\|(?!\r\n).)

The above encapsulates some special behaviors:

TR18#1.6 also recommends a \R to match any newline. That's a separate proposal, but it could be easily accommodated in the parser by lowering it:

flags \R (if added) lowered to (it captures)
any \r\n\|(?!\r\n)[\n\r\u0085\u2028\u2029]

Note this is not the same as the example pattern in TR18.

All the above excludes \v and \f: it's clear from TR18 they MAY match, but it's not clear whether they MUST. The neat thing about lowering is that if we want them, it's just a matter of adding them next to \u0085 etc in the parser and not in 13 other places.

Of course, for efficiency reasons we probably want to leave today's special knowledge of line endings in the runners, because we probably can't currently evaluate (?=\z|\n\z) as efficiently as a simple length and single character check. But the new AnyNewLine behavior could lower, as it's probably OK if it's not quite as fast. The performance without AnyNewLine would not be affected. Later, we could look at optimizing, since for example (?=\r\n|[\r\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n) is not going to be an efficient way to match \n.

In short, I think we should start again and this time only touch the parser.

Thoughts -- @stephentoub @jzabroski ?

svick commented 4 years ago

@danmosemsft

But the new AnyNewLine behavior could lower, as it's probably OK if it's not quite as fast. […] Later, we could look at optimizing […]

I think that the speed of AnyNewLine should be on par with the current behavior, at least as long as we're talking about the common case of dealing only with \n and \r\n. Would it be reasonable to somehow detect the presence of the less common line endings and only switch to the less efficient approach if they are found?

Or are you saying that the performance of AnyNewLine with your proposed implementation could be on par with the current behavior, but it's not necessary for the initial implementation to get there?

danmoseley commented 4 years ago

that the performance of AnyNewLine with your proposed implementation could be on par with the current behavior, but it's not necessary for the initial implementation to get there?

This, although it's never going to be possible to be as good given there's necessarily more tests and branches.

jzabroski commented 4 years ago

@danmosemsft Thanks for putting this together. This issue has been on my mind for some time, but life keeps getting in the way!

The main reason I did not put this on my plate is precisely the risk of causing a regression in behavior that you highlight by the thirteen different places that need to be considered. Perhaps I am too detail-oriented, but as we are nearing .NET Core 5, my thought was, "Maybe this should wait for the next minor version?"

In terms of your design proposal, I agree that making the code long-term easier to understand and maintain is a big win. Having the parser handle the phrase as a token makes sense, and also cuts down the risk of a major regression due to unforeseen feature combinations. It would also be easier to lift back up to the interpreter / compiler the token. Plus, you could have a compiler that analyzes the parse tree to determine if it is an "extended Regular Expression" or actual "regular language Regular Expression." Similar to how .NET Expression tree API allows preferInterpretation. In most cases, it would be convenient to have the high-level language services just transparently do what's optimal.

jzabroski commented 4 years ago

@stephentoub For .NET vNext, can you comment on @danmosemsft ideas to lift this: https://github.com/dotnet/runtime/issues/25598#issuecomment-666955410

stephentoub commented 4 years ago

I like the simplicity and isolated nature of that approach, and I think it's reasonable to start, at least for a prototype, and see a) whether it's functionally correct with all the various new tests we throw at it, and b) what the performance hit is without further optimization. Assuming it's functionally sound and the perf hit is reasonable, it sounds like a good way to go for now; if in the future the perf of it became more of an issue, we could address that in a variety of ways, either by adding more optimizations for the patterns it employs, or by switching away from doing it purely in the parser as a lowering step.

danmoseley commented 4 years ago

Now we just need a volunteer...

jzabroski commented 4 years ago

(Just got back from vacation.) I'm interested in doing it, and think I almost got my two open source projects in a state where they're ready for v5. Anyway, I think it would be a lot of fun to do - if I use that word. :)

danmoseley commented 4 years ago

Great, I assigned you @jzabroski . As @stephentoub mentions, you'll probably want to start by prototyping and getting perf measurements. Thanks!

danmoseley commented 2 years ago

@jzabroski I cleared your assignment.

jzabroski commented 2 years ago

@danmoseley Thanks. Sorry about not finishing this. I caught Covid shortly after you assigned this and my time for additional work outside my normal hours diminished (long recovery).

danmoseley commented 2 years ago

no problem! sorry to hear about that 😿 you're always welcome to contribute here or to anything else in the repo.

danmoseley commented 1 year ago

Returning to this briefly, I wanted to point out that the behavior described in the top post is not what we want. It says that "The match must occur at the end of the string or before \r\n, \n or \r at the end of the string."

That is not desirable, because \r\n is logically a single line break. This how it is treated by all text editors, that is what is intended when code (usually on Windows) writes it. Also, the Unicode regex standard TR18 explicitly indicates that they go together indivisibly. And it is the behavior of PCRE, Perl, Java etc when you use \R.

That's why in my table above, I have $ equivalent to (effectively) (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z) (in non multiline) and not (?=\r\n\z|\r\z|\n\z|\z) as in the topmost post.

danmoseley commented 1 year ago

@stephentoub I hacked up what lowering might look like -- see https://github.com/dotnet/runtime/compare/main...danmoseley:runtime:anynewline?expand=1#diff-548465bb4dbdd05024f0676c944e3c2b89677c48257bf28dfb0afccde31ded0aR19

I wrapped the pattern in an iterator with the idea that it could show "real" locations in the pattern in error messages, but not sure whether it's worth that abstraction.

Do you think this is a worthwhile direction?

Edit: The approach makes those handful of tests pass. I'd need to do more tests/inspection to see whether there's other positional state that would need correction on inserting into the pattern like this, and update that too.

I suspect perf with this lowering approach may be more of a worry than the parser. Once the flag exists, folks may update existing patterns to use it, which may then be a perf trap if they are "hot". Possibly adding \R would be less of a problem in this respect as it would require modifying the pattern, so less "easy", and by putting \R in a non capturing group or an assertion as appropriate you can achieve some of the same things as AnyNewLine albeit potentially in a clumsy way (but a standard way).

So a possible approach might be to implement \R for .NET 8 then at a future point add AnyNewLine but properly, as it's own node type that the engines accommodate.

(In this model \R would match any new line out of the gate, but that's the same as PCRE: by default in PCRE, . $ \Z recognize only \n but \R recognizes all newline flavors.)

danmoseley commented 1 year ago

I didn't like the parser approach -- it's clumsy and will run slow. I'm working on supporting this properly, but in the generator only -- that way we can do it efficiently plus avoid adding any runtime cost for the regular case.

working in https://github.com/dotnet/runtime/compare/main...danmoseley:anynewline2?expand=1

stephentoub commented 1 year ago

I'm working on supporting this properly, but in the generator only

You mean someone else would add it for the other engines?

danmoseley commented 1 year ago

Potentially, in future, with care to avoid impacting perf in existing cases. As AnyNewLine would be new, I think it's OK if only the new engine supports it (meaning it will only work for static patterns) initially.

Just checking for more than \n is easy, the complexity is dealing with \r\n atomically

danmoseley commented 1 year ago

What are your thoughts, would it be acceptable to only support for static patterns (generator)

stephentoub commented 1 year ago

I think it's OK if only the new engine supports it (meaning it will only work for static patterns)

I'm not a fan of that. At a minimum, when we support it, it should be in all but the NonBacktracking engine, but ideally all of them. There are currently zero differences in supported functionality between the interpreter, compiler, and source generator; I'm not excited to add any. Plus, there are cases where the source generator will fall back to the other engines, eg case-insensitive back references, so whether this new option would work would be very confusing.