Open jzabroski opened 6 years ago
I don't like that this would depend on Environment.NewLine
. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine
.
Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not):
OS | Line-ending style | Current | EnvironmentNewLine |
AnyNewLine |
---|---|---|---|---|
Windows | Windows | ✗ | ✓ | ✓ |
Windows | Unix | ✓ | ✗ | ✓ |
Unix | Windows | ✗ | ✗ | ✓ |
Unix | Unix | ✓ | ✓ | ✓ |
Petr, that is exactly what I meant if you looked at how I defined it: (?=[\r\n|\z) can match any mix of line endings, even files trashed by mixed line endings. AnyNewline is a MUCH better name than what I came up with. Love teamwork!
Adding a portable sigil for Environment.NewLine probably isn't a bad idea, but the only pragmatic way I could think of doing it would be to export all Regex sigils and let the consumer of the API reprogram the sigil mapping. I thought about this, but felt it was too complicated and also ruin the tooling around current .NET regular expressions. Sites like RegexHero make writing Regex so easy that I wanted a solution that would be easy for tools to adopt.
The only gotchas I typically encounter are forgetting to escape a pipe, and then forgetting about how line ending semantics works. Tools help me with the former, but the latter always trips me up.
On Sat, Mar 24, 2018, 8:32 AM Petr Onderka notifications@github.com wrote:
I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine.
Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not): OS Line-ending style Current EnvironmentNewLine AnyNewLine Windows Windows ✗ ✓ ✓ Windows Unix ✓ ✗ ✓ Unix Windows ✗ ✗ ✓ Unix Unix ✓ ✓ ✓
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/28410#issuecomment-375884537, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbT_WOjuiqujVh1zPVyNau26LElEe-aks5thjzMgaJpZM4S473p .
Actually, I see now that in my original description my description of the behavior is inconsistent. Thank you for that wonderful table. I will update the first post with it.
@jnm2 @svick I updated the original description. Please give feedback.
I like the idea and its cross-platform benefit.
I think a setting that would match any line ending
How is any line ending defined? The name suggests that it can handle any line ending, are there more than Unix and Windows line ending chars? We could probably refine the name.
@ViktorHofer
How is any line ending defined?
According to Wikipedia's summary of the Unicode standard, there are 8 character sequences that form a line break (7 single characters, including CR and LF, and the sequence CR+LF).
Though I think those other characters are not widely used, so supporting only either LF and CR+LF (Unix and Windows) or CR, LF and CR+LF (old Mac, Unix and Windows) would be fine, especially if it lead to better performance than fully supporting the Unicode definition of line break.
@jzabroski
(?=[\r\n]|\z)
I don't understand this regex, as far as I can tell, it wouldn't match \r\n
, and so \r\n
would be reported as two line breaks.
I'll write some test cases with XUnit and use that as a specification. We can then tweak the test cases to come up with a final specification/acceptance criteria. Fare could also be used to confirm.
potential hackathon candidate if we review the API addition in time.
cc @karelz
Looks good. A few comments:
#if DBG
in code)AcceptAllLineEndings
?@terrajobst
The table looks wrong (Windows on Windows on the Current should work IMHO)
I don't think the table is wrong, see the documentation linked in the top post, where it says that you should use \r?$
to match Windows-style line endings in multiline mode.
You can also try running code like the following:
var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);
foreach (char c in match.Value)
Console.WriteLine($"{c} (0x{(int)c:X2})");
For me, running it on .Net Core 2.1 RTM on Windows 10 prints:
f (0x66)
o (0x6F)
o (0x6F)
(0x0D)
While if this was "working well", I would expect to get just the three characters of "foo"
as the match, not "foo\r"
.
@terrajobst I agree with @svick . Intuitively, all his table is saying is that neither $ nor Environment.NewLine is fully portable solution for the two ubiquitous platforms. AnyNewLine therefore proposes to fill that gap.
@terrajobst
I'll update the original post to modify the enum options to be 0x0400 (1024)
Taking a stab at this for the Hackaton
Ok, assigned you.
Had some fun this afternoon going through the internals of System.Text.RegularExpressions
I pushed what I have so far here: https://github.com/Mpdreamz/corefx/commit/4c4d6882802feae0a50e19dd18033ab6af648270
I might pick it up for fun at some point again, feel free to unassign me 😄
wow that's great progress! it's up to you if you want to continue working on it or pass it on to me / other community members.
Unassigning for someone else to look. Thanks for the start!
Hi @danmosemsft, I would like to have a go at this. Can you please get this assigned to me. Thanks.
@shishirchawla sounds great, go ahead! BTW I sent you a collaborator invite. When you accept LMK and I will assign this formally.
Assigned, it's all yours @shishirchawla work on any cadence you want and just post here if you have any questions or issues.
Hi @danmosemsft / @jzabroski ,
Heres my PR for the new API - https://github.com/dotnet/corefx/pull/41195 . This is my first PR on the project, so please bare with me.
Looking forward to your feedback.
Shishir
@danmosemsft @terrajobst @ViktorHofer @svick It seems that during the process of implementing this feature, @shishirchawla discovered an area to improve the API a tiny bit further. If you go to the Anchors documentation, there are two anchors, $
and \Z
, which mention Unix-specific \n
line ending. I've updated the top post of this issue with the full notes, including @terrajobst API review notes, so that everything is consolidated in one post.
I like @shishirchawla 's improvement, but I think the final focus should be on how this is communicated in the Docs, such as https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions and ensuring full test coverage.
@terrajobst could you please give your take on the \Z
change? Does this need to go back to review or is it arguably what was intended?
@danmosemsft So sorry I'm a foot dragger on this. Been trying to keep up with my own open source project, life, work, etc.
Relevant examples / test cases:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-line- https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-before-ending-newline-z https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-only-z
The examples are a bit idiosyncratic in that they reference Environment.NewLine
, and it probably isn't immediately clear to the documentation writer that the example output will change depending on the runtime OS environment.
ping @terrajobst
The name AnyNewLine
works for me. Unless someone of @dotnet/fxdc disagress, I wouldn't bring this back to API review. Our feedback was mostly around semantics, and that seems to have been take care of.
@terrajobst The question wasn't about AnyNewLine
(that was what was originally approved) but confirming that you expected the new behavior for \Z
as well as $
. I cannot imagine you don't...
I still owe you all some sample documentation, I think, too... been stuck on various other projects and yak shaving.
@jzabroski @shishirchawla , @terrajobst remark is sufficient for me. We can assume \Z
is desired as well.
@danmosemsft Great - @shishirchawla still needs to update the PR with the review comments. I do think we want to explicitly cover the test case given by @svick earlier, as its exactly the scenario that confuses everyone. My head hurts every time I try to predict the right answer after not looking at this thread for a week.
@danmosemsft @shishirchawla Updated the this Issue Title to reflect reality.
I was working to rebase #1449 on master, but it needs partly reimplementing at this point. And I realized there are some omissions, but a larger issue is that adding support in the same way for the other 3-5 kinds of newlines in TR18#1.6 will make the code unwieldy. Currently \n
is already specially treated in 13 places in the interpreter, compiler, and each of those would need an update to recognize the other line endings, lengthening hot code paths, with on top of that the special complication that \r\n
is two characters and should be treated as one.
I thought about this for some time and realized that we could instead "lower" several of these into equivalent primitive components, so that nothing but the parser need know about them. Here is how they would "lower", as best I can determine:
flags | ^ lowered to (no capture) | $ lowered to (no capture) | \Z lowered to (no capture) |
---|---|---|---|
neither | \A |
(?=\z\|\n\z) |
(Same as $ with neither) |
RegexOptions.AnyNewLine (new) |
\A |
(?=\r\n\z\|[\r\u0085\u2028\u2029]?\z)\|(?<!\r)(?=\n\z) |
(Same as $ with RegexOptions.AnyNewLine ) |
RegexOptions.Multiline |
(?<=\A\|\n) |
(?=\z\|\n) |
(Same as $ with neither) |
RegexOptions.Multiline \| RegexOptions.AnyNewLine (new) |
(?<=\A\|\r\n\|[\n\u0085\u2028\u2029])\|(?<=\r)(?!\n) |
(?=\r\n\|[\r\u0085\u2028\u2029]\|\z)\|(?<!\r)(?=\n) |
(Same as $ with RegexOptions.AnyNewLine ) |
flags | . lowered to (it captures) |
---|---|
neither | [^\n] |
RegexOptions.AnyNewLine (new) |
[^\n\r\u0085\u2028\u2029] |
RegexOptions.SingleLine |
. |
RegexOptions.SingleLine \| RegexOptions.AnyNewLine (new) |
(\r\n\|(?!\r\n).) |
The above encapsulates some special behaviors:
$
and .
(assuming AnyNewLine) must be satisfied only once in \r\n
, but otherwise everything else is fair game - eg., \r\r
is two matches.$
and ^
are asymmetrical. $
is satisfied in "abc\n" twice - both at the \n
and the end (same as PCRE, but not Python) whereas ^
is only satisfied in "\nabc" one time. (Of course, if you're not matching \n
then you'll never notice.). This is the same with \Z
- it will be satisfied twice in "abc\n".TR18#1.6 also recommends a \R
to match any newline. That's a separate proposal, but it could be easily accommodated in the parser by lowering it:
flags | \R (if added) lowered to (it captures) |
---|---|
any | \r\n\|(?!\r\n)[\n\r\u0085\u2028\u2029] |
Note this is not the same as the example pattern in TR18.
All the above excludes \v
and \f
: it's clear from TR18 they MAY match, but it's not clear whether they MUST. The neat thing about lowering is that if we want them, it's just a matter of adding them next to \u0085
etc in the parser and not in 13 other places.
Of course, for efficiency reasons we probably want to leave today's special knowledge of line endings in the runners, because we probably can't currently evaluate (?=\z|\n\z)
as efficiently as a simple length and single character check. But the new AnyNewLine behavior could lower, as it's probably OK if it's not quite as fast. The performance without AnyNewLine would not be affected. Later, we could look at optimizing, since for example (?=\r\n|[\r\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n)
is not going to be an efficient way to match \n
.
In short, I think we should start again and this time only touch the parser.
Thoughts -- @stephentoub @jzabroski ?
@danmosemsft
But the new AnyNewLine behavior could lower, as it's probably OK if it's not quite as fast. […] Later, we could look at optimizing […]
I think that the speed of AnyNewLine
should be on par with the current behavior, at least as long as we're talking about the common case of dealing only with \n
and \r\n
. Would it be reasonable to somehow detect the presence of the less common line endings and only switch to the less efficient approach if they are found?
Or are you saying that the performance of AnyNewLine
with your proposed implementation could be on par with the current behavior, but it's not necessary for the initial implementation to get there?
that the performance of AnyNewLine with your proposed implementation could be on par with the current behavior, but it's not necessary for the initial implementation to get there?
This, although it's never going to be possible to be as good given there's necessarily more tests and branches.
@danmosemsft Thanks for putting this together. This issue has been on my mind for some time, but life keeps getting in the way!
The main reason I did not put this on my plate is precisely the risk of causing a regression in behavior that you highlight by the thirteen different places that need to be considered. Perhaps I am too detail-oriented, but as we are nearing .NET Core 5, my thought was, "Maybe this should wait for the next minor version?"
In terms of your design proposal, I agree that making the code long-term easier to understand and maintain is a big win. Having the parser handle the phrase as a token makes sense, and also cuts down the risk of a major regression due to unforeseen feature combinations. It would also be easier to lift back up to the interpreter / compiler the token. Plus, you could have a compiler that analyzes the parse tree to determine if it is an "extended Regular Expression" or actual "regular language Regular Expression." Similar to how .NET Expression tree API allows preferInterpretation. In most cases, it would be convenient to have the high-level language services just transparently do what's optimal.
@stephentoub For .NET vNext, can you comment on @danmosemsft ideas to lift this: https://github.com/dotnet/runtime/issues/25598#issuecomment-666955410
I like the simplicity and isolated nature of that approach, and I think it's reasonable to start, at least for a prototype, and see a) whether it's functionally correct with all the various new tests we throw at it, and b) what the performance hit is without further optimization. Assuming it's functionally sound and the perf hit is reasonable, it sounds like a good way to go for now; if in the future the perf of it became more of an issue, we could address that in a variety of ways, either by adding more optimizations for the patterns it employs, or by switching away from doing it purely in the parser as a lowering step.
Now we just need a volunteer...
(Just got back from vacation.) I'm interested in doing it, and think I almost got my two open source projects in a state where they're ready for v5. Anyway, I think it would be a lot of fun to do - if I use that word. :)
Great, I assigned you @jzabroski . As @stephentoub mentions, you'll probably want to start by prototyping and getting perf measurements. Thanks!
@jzabroski I cleared your assignment.
@danmoseley Thanks. Sorry about not finishing this. I caught Covid shortly after you assigned this and my time for additional work outside my normal hours diminished (long recovery).
no problem! sorry to hear about that 😿 you're always welcome to contribute here or to anything else in the repo.
Returning to this briefly, I wanted to point out that the behavior described in the top post is not what we want. It says that "The match must occur at the end of the string or before \r\n, \n or \r at the end of the string."
That is not desirable, because \r\n
is logically a single line break. This how it is treated by all text editors, that is what is intended when code (usually on Windows) writes it. Also, the Unicode regex standard TR18 explicitly indicates that they go together indivisibly. And it is the behavior of PCRE, Perl, Java etc when you use \R.
That's why in my table above, I have $ equivalent to (effectively) (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z)
(in non multiline) and not (?=\r\n\z|\r\z|\n\z|\z)
as in the topmost post.
@stephentoub I hacked up what lowering might look like -- see https://github.com/dotnet/runtime/compare/main...danmoseley:runtime:anynewline?expand=1#diff-548465bb4dbdd05024f0676c944e3c2b89677c48257bf28dfb0afccde31ded0aR19
I wrapped the pattern in an iterator with the idea that it could show "real" locations in the pattern in error messages, but not sure whether it's worth that abstraction.
Do you think this is a worthwhile direction?
Edit: The approach makes those handful of tests pass. I'd need to do more tests/inspection to see whether there's other positional state that would need correction on inserting into the pattern like this, and update that too.
I suspect perf with this lowering approach may be more of a worry than the parser. Once the flag exists, folks may update existing patterns to use it, which may then be a perf trap if they are "hot". Possibly adding \R would be less of a problem in this respect as it would require modifying the pattern, so less "easy", and by putting \R in a non capturing group or an assertion as appropriate you can achieve some of the same things as AnyNewLine albeit potentially in a clumsy way (but a standard way).
So a possible approach might be to implement \R for .NET 8 then at a future point add AnyNewLine but properly, as it's own node type that the engines accommodate.
(In this model \R would match any new line out of the gate, but that's the same as PCRE: by default in PCRE, . $ \Z recognize only \n but \R recognizes all newline flavors.)
I didn't like the parser approach -- it's clumsy and will run slow. I'm working on supporting this properly, but in the generator only -- that way we can do it efficiently plus avoid adding any runtime cost for the regular case.
working in https://github.com/dotnet/runtime/compare/main...danmoseley:anynewline2?expand=1
I'm working on supporting this properly, but in the generator only
You mean someone else would add it for the other engines?
Potentially, in future, with care to avoid impacting perf in existing cases. As AnyNewLine would be new, I think it's OK if only the new engine supports it (meaning it will only work for static patterns) initially.
Just checking for more than \n is easy, the complexity is dealing with \r\n atomically
What are your thoughts, would it be acceptable to only support for static patterns (generator)
I think it's OK if only the new engine supports it (meaning it will only work for static patterns)
I'm not a fan of that. At a minimum, when we support it, it should be in all but the NonBacktracking engine, but ideally all of them. There are currently zero differences in supported functionality between the interpreter, compiler, and source generator; I'm not excited to add any. Plus, there are cases where the source generator will fall back to the other engines, eg case-insensitive back references, so whether this new option would work would be very confusing.
From MSDN:
This is by far one of the biggest gotchas with using .NET Regex class.
I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.
Portability concerns According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.
The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.
Backward compatibility concerns Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.
Here is Petr Onderka (@svick)'s summary:
Api Proposal
edit by @ViktorHofer
API Review Notes
Video
Looks good. A few comments:
We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)Spec updated so thatAnyNewLine = 0x400
(1024).The table looks wrong (Windows on Windows on the Current should work IMHO)The table is correct. The fact this trips up experts just speaks to why this is a profound GOTCHA in the Core SDK.May be AcceptAllLineEndings?Some hallway testing I've done indicates AnyNewLine is a good name. Plus, (argument after the final name was chosen) this enumeration will be transliterated into a checkbox on Regular Expression Visualization tools like Regex Hero, so it is preferable to have a concise explanation for the feature to avoid excessive screen space.PR Review Notes
After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning
\Z
anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.
AnyNewLine as Anchor Modifier to \Z and $ Anchors
(?=\n\z\|\z)
\n
at the end of the string.$
with this option.)$
with this option.)RegexOptions.Multiline
(?=\n\|\n\z\|\z)
\n
anywhere in the string.(?=\n\z\|\z)
\n
at the end of the string.RegexOptions.Multiline \| RegexOptions.AnyNewLine
(?=\r\n\|\r\|\n\|\r\n\z\|\r\z\|\n\z\|\z)
\r\n
,\n
or\r
anywhere in the string.(?=\r\n\z\|\r\z\|\n\z\|\z)
\r\n
,\n
or\r
at the end of the string.RegexOptions.AnyNewLine
(?=\r\n\z\|\r\z\|\n\z\|\z)
\r\n
,\n
or\r
at the end of the string.$
with this option.)$
with this option.)