Make enum RegexParseError and RegexParseException public

abelbraaksma commented 4 years ago

Background and Motivation

A regular expression object made with System.Text.Regex is essentially an ad-hoc compiled sub-language that's widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.

We already have an internal RegexParseException and two properties: Error and Offset, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException is raised, it is in fact a RegexParseException which inherits ArgumentException.

I've checked the existing code and I propose we make RegexParseException and RegexParseError public, these are pretty self-describing at the moment, though the enum cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.

Use cases

Online regex tools may use the more detailed info to suggest corrections of the regex to users (like: "Did you forget to escape this character?").
Debugging experience w.r.t. regular expressions improves.
Currently, getting col and row requires parsing the string, and isn't in the string in Framework. Parsing in i18n scenarios is next to impossible, giving an enum and position helps writing better, and deterministic code
Improve tooling by using the offset to place squiggles under errors in regexes.
Self-correcting systems may use the extra info to close parentheses or brackets, or fix escape sequences that are incomplete.
It is simply better to be able to check for explicit errors than the more generic ArgumentException which is used everywhere.
BCL tests on regex errors now uses reflection, this is no longer necessary.

Related requests and proposals

In this comment (https://github.com/dotnet/runtime/issues/13942#issuecomment-70185134) @terrajobst first proposed to open-up this exception for improving error handling.
Original proposal by @danmosemsft: https://github.com/dotnet/runtime/issues/1080
Tooling improvements with squiggles: https://github.com/dotnet/corefx/pull/29178#discussion_r182842595 and https://github.com/dotnet/runtime/issues/372#issuecomment-559245612

Proposed API

The current API already exists but isn't public. The definitions are as follows:

    [Serializable]
-    internal sealed class RegexParseException : ArgumentException
+    public class RegexParseException : ArgumentException
    {
        private readonly RegexParseError _error; // tests access this via private reflection

        /// <summary>Gets the error that happened during parsing.</summary>
        public RegexParseError Error => _error;

        /// <summary>Gets the offset in the supplied pattern.</summary>
        public int Offset { get; }

        public RegexParseException(RegexParseError error, int offset, string message) : base(message)
        {
+            // add logic to test range of 'error' and return UnknownParseError if out of range
            _error = error;
            Offset = offset;
        }

        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            base.GetObjectData(info, context);
            info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
        }
    }

And the enum with suggested names for a more discoverable naming scheme. I followed "clarity over brevity" and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.

I'd suggest we add a case for unknown conditions, something like UnknownParseError = 0, which could be used if users create this exception by hand with an invalid enum value.

Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.

-internal enum RegexParseError
+public enum RegexParseError
{
+    UnknownParseError = 0,    // do we want to add this catch all in case other conditions emerge?
+    AlternationHasComment,
+    AlternationHasMalformedCondition,  // *maybe? No tests, code never hits
+    AlternationHasMalformedReference,  // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+    AlternationHasNamedCapture,        // like @"(?(?<x>)true|false)"
+    AlternationHasTooManyConditions,   // like @"(?(foo)a|b|c)"
+    AlternationHasUndefinedReference,  // like @"(x)(?(3)x|y)" or @"(?(1))"
+    CaptureGroupNameInvalid,           // like @"(?< >)" or @"(?'x)"
+    CaptureGroupOfZero,                // like @"(?'0'foo)" or @("(?<0>x)"
+    ExclusionGroupNotLast,             // like @"[a-z-[xy]A]"
+    InsufficientClosingParentheses,    // like @"(((foo))"
+    InsufficientOpeningParentheses,    // like @"((foo)))"
+    InsufficientOrInvalidHexDigits,    // like @"\uabc" or @"\xr"
+    InvalidGroupingConstruct,          // like @"(?" or @"(?<foo"
+    InvalidUnicodePropertyEscape,      // like @"\p{Ll" or @"\p{ L}"
+    MalformedNamedReference,           // like @"\k<"
+    MalformedUnicodePropertyEscape,    // like @"\p{}" or @"\p {L}"
+    MissingControlCharacter,           // like @"\c"
+    NestedQuantifiersNotParenthesized  // @"abc**"
+    QuantifierAfterNothing,            // like @"((*foo)bar)"
+    QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+    ReversedCharacterRange,            // like @"[z-a]"   (only in char classes, see also ReversedQuantifierRange)
+    ReversedQuantifierRange,           // like @"abc{3,0}"  (only in quantifiers, see also ReversedCharacterRange)
+    ShorthandClassInCharacterRange,    // like @"[a-\w]" or @"[a-\p{L}]"
+    UndefinedNamedReference,           // like @"\k<x>"
+    UndefinedNumberedReference,        // like @"(x)\2"
+    UnescapedEndingBackslash,          // like @"foo\" or @"bar\\\\\"
+    UnrecognizedControlCharacter,      // like @"\c!"
+    UnrecognizedEscape,                // like @"\C" or @"\k<" or @"[\B]"
+    UnrecognizedUnicodeProperty,       // like @"\p{Lll}"
+    UnterminatedBracket,               // like @"[a-b"
+    UnterminatedComment,
}

^{* About IllegalCondition, this is thrown inside a conditional alternation like (?(foo)x|y), but appears to never be hit. There is no test case covering this error.}

Usage Examples

Here's an example where we use the additional info to give more detailed feedback to the user:

public class TestRE
{
    public static Regex CreateAndLog(string regex)
    {
        try
        {
            var re = new Regex(regex);
            return re;
        }
        catch(RegexParseException reExc)
        {
            switch(reExc.Error)
            {
                case RegexParseError.TooFewHex:
                    Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
                    break;
                case RegexParseError.UndefinedBackref:
                    Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
                    break;
                case RegexParseError.UnknownUnicodeProperty:
                    Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
                    break;
                // ... etc
            }
            return null;
        }
    }
}

Alternative Designs

Alternatively, we may remove the type entirely and merely throw an ArgumentException. But it is likely that some people rely on the internal type, even though it isn't public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.

Risks

The only risk I can think of is that after exposing this exception, people would like even more details. But that's probably a good thing and only improves the existing API.

Note that:

Existing code that checks for ArgumentException continues to work.
While debugging, people who see the underlying exception type can now actually use it.
Existing code using reflection to get to the extra data may or may not not continue to work, it depends on how strict the search for the type is done.

[danmose: made some more edits]

ghost commented 4 years ago

Tagging subscribers to this area: @eerhardt Notify danmosemsft if you want to be subscribed.

danmoseley commented 4 years ago

Thanks for writing this up.

By exposing the constructors, people can overwrite the exception in favor of a more specific one.

Do you have evidence folks need to do this? If not, we might leave them out - that's up to the API review folks.

For the enum,

is there some ordering that makes sense other than semi-random?
you might want to do some regularization eg, either "Ref" or "Reference" and not both "Ref" and "ref".
maybe "IncompleteSlashP" should say something about unicode categories instead?
"Capnum" is not a publicly meaningful name
"Invalid" is probably better than "Illegal" in new API

Offset should note it's zero based.

If we have the constructor it should probably verify RegexParseError isn't out of range, like FileStream does, but that's implementation. (And offset >= 0)

            if (access < FileAccess.Read || access > FileAccess.ReadWrite)
                throw new ArgumentOutOfRangeException(nameof(access), SR.ArgumentOutOfRange_Enum);

@CyrusNajmabadi any feedback given you've done something similar? specifically on the categories in the enum.

abelbraaksma commented 4 years ago

I wasn't sure about the enum values, to leave them as is, or not. I don't particularly like them, I'll see if I can improve the names.

The order of the enum shouldn't matter, I think? But yeah, it's a bit random. I'll go over your other suggestions too.

pgovind commented 4 years ago

I think this is close to being ready for review, save for the comments that Dan mentioned. Considering that it'd probably not get API reviewed in time for .NET 5, I'm going to mark it as Future, but I'd really like to see this get in for .NET 6. I'll just add that we'll need to document this class(and enum) when we make it public.

danmoseley commented 4 years ago

The order of the enum shouldn't matter, I think?

I think the only reason would be to make it easier on anyone reading it in the docs. In Visual Studio, I think it orders the completion list by usage.

BTW I was going to suggest to remove [Serializable] since we don't want to add that to anywhere new because BinaryFormatter is highly problematic. But it was already there - that would be a breaking change.

SingleAccretion commented 4 years ago

BTW I was going to suggest to remove [Serializable] since we don't want to add that to anywhere new because BinaryFormatter is highly problematic

@danmosemsft The VS snippet for an exception has it as the default, heh. I was actually wondering, should I make exceptions in my library serializable? The recently added AsnContentException and CborContentException are both marked as such.

danmoseley commented 4 years ago

@SingleAccretion we do not recommend future use of BinaryFormatter unless you have no alternative. The primary reason is that it is vital you do not deserialize untrusted input, and it's too easy for system to evolve and end up feeding untrusted input. Eg., you have a game that saves state; users load state from their disk. Later, someone reports a bug and you ask for the saved game to reproduce the bug.

@GrabYourPitchforks is working on all up serialization guidance with a focus on doing it securely. He's also devising a plan for helping the ecosystem progressively move off BinaryFormatter. These should be available on the dotnet/designs repo or similar place in due course. Meantime, we're not adding [Serializable] anywhere new.

danmoseley commented 4 years ago

The VS snippet for an exception has it as the default

@GrabYourPitchforks we should probably ask them to change that, right?

SingleAccretion commented 4 years ago

Meantime, we're not adding [Serializable] anywhere new.

I see, so I will remove it then. I assume the *Content*Exceptions should also be made not serializable (maybe there should be an issue/PR for that)?

danmoseley commented 4 years ago

@bartonjs should answer that.

abelbraaksma commented 4 years ago

@danmosemsft, should the exception remain sealed? It doesn't seem sensible to leave it sealed, and u don't think it breaks anything if we open it up for inheritance.

What do you think?

bartonjs commented 4 years ago

If there's any chance the exception type could be used on .NET Framework, it should still be [Serializable], for exceptions-across-AppDomains concerns. If it's (semi-guaranteed) .NET5+-only, then I think we're now OK with not making it [Serializable].

danmoseley commented 4 years ago

I don't know what the recommendation is on sealed. The API review can correct that if needed.

abelbraaksma commented 4 years ago

@danmosemsft

I've gone over your comments and updated the original text. I went over each enum value to check the actual conditions used to throw it. This was a bit surprising at times (as in: not in line with the expected meaning of the token), which I've tried to reflect in suggested new names for the enum cases.

You comments were:

By exposing the constructors, people can overwrite the exception in favor of a more specific one.

Do you have evidence folks need to do this? If not, we might leave them out - that's up to the API review folks.

Done, removed. I have no evidence, was just a hunch.

For the enum, is there some ordering that makes sense other than semi-random?

For ease of comparison with the original, I've left the order as-is, but chose a naming scheme that should make some sense when ordered alphabetically. I'm fine if the final order would either be alphabetical, or somewhat grouped.

you might want to do some regularization eg, either "Ref" or "Reference" and not both "Ref" and "ref".

Done, I've tried to make same-meaning-same-wording and where it made sense, I chose clarity-over-brevity (in this case, Reference, not Ref). I assume that this will be subject to change during the review.

maybe "IncompleteSlashP" should say something about unicode categories instead?

Done, including other names that benefited from better naming.

"Capnum" is not a publicly meaningful name

Done: NumericCaptureCannotBeZero.

"Invalid" is probably better than "Illegal" in new API

Done: IllegalRange -> ReversedQuantifierRange, in line with the other Reversed token. The word Illegal is gone in all monikers.

Offset should note it's zero based.

Done.

While testing each condition one by one, I've come across a few oddities and an unexpected NullReferenceException, I'll log those as bugs separately.

abelbraaksma commented 4 years ago

Considering that it'd probably not get API reviewed in time for .NET 5, I'm going to mark it as Future, but I'd really like to see this get in for .NET 6

@pgovind, I'd argue it'd be a good, and relatively cheap improvement to the regex-experience to have this in for .NET 5 (cheap, because the code is already there, including relevant tests, except for some renaming). Of course, I cannot judge whether this makes the mark and I realize that .NET 5 is close to being frozen, but if it's somehow still possible to put it on the list, that would be really welcome :).

danmoseley commented 4 years ago

This was a bit surprising at times (as in: not in line with the expected meaning of the token), which I've tried to reflect in suggested new names for the enum cases.

Are there any we should fix - at the same time as exposing the API? We own the code after all 🙂 and we want the enum values to be useful.

danmoseley commented 4 years ago

if it's somehow still possible to put it on the list, that would be really welcome

The 5.0 milestone means "must fix to ship it". This is what Microsoft developers are focusing on. It doesn't mean "not allowed in release" for another few weeks yet. It could get in if we could do API review (likely bottleneck) plus then a community member gets a PR up and we can merge it.

abelbraaksma commented 4 years ago

plus then a community member gets a PR up and we can merge it.

Got it. I'll volunteer, it's a rather trivial change anyway.

abelbraaksma commented 4 years ago

Are there any we should fix - at the same time as exposing the API? We own the code after all :slightly_smiling_face: and we want the enum values to be useful.

The new proposed names in the updated OP are sufficient. There can be some debate on whether we should improve certain errors (as in, raise the more specific one if we have enough information). The few I thought make that threshold are in the linked issue (#39075).

But to be clear, the current behavior wrt to the error text is as one would expect, the clarification is on the choice of enum names (like: where an enum only ever is used within conditional alternation contexts, I've proposed a name that clarifies precisely that).

I've checked the source code and tests for each of them to be sure this is the case, and added a small example of each error to help the reviewers.

abelbraaksma commented 4 years ago

@danmosemsft, is there anything else that needs attention, or can it be marked api-ready-for-review?

danmoseley commented 4 years ago

Missed this, excuse me.

I think it looks good -- would you mind breaking the diff into additions and subtractions? That isn't what we normally do but this is a bit special because the "subtractions" are internal. So they are irrelevant to API review, only to the implementer.

Edit: actually I suggest simply showing the adds, then you can alphabetize which is what they'll likely want.

IllegalCondition

It would be good to either find a way to hit this (and ideally add a test) or remove it, so we can decide whether we need it or not.

One other thing occurred to me -- do we know that the offsets are correct? Those would just be bugs, not influencing the API, but those bugs would be exposed as soon as we offer this API.

@eerhardt @pgovind any other feedback or can we mark this api-ready-for-review?

danmoseley commented 4 years ago

Some other low value comments

some refer to XXXNamedReference, some to XXBackReference. I think named backreferences are still backreferences?
some describe the problem eg ShorthandClassInCharRange, InsufficientOrInvalidHexDigits, AlternationHasTooManyConditions and some the broken rule eg NestedQuantifiersMustBeParenthesized, AlternationCannotHaveComment, ExclusionGroupMustBeLast. Maybe it should stick to one style or the other. I suggest the former as I think ti's more typical for compilers, etc.
ShorthandClassInCharRange contracts Character ?

abelbraaksma commented 4 years ago

One other thing occurred to me -- do we know that the offsets are correct?

I've tested a handful of them while going over each error and generally they appear correct. Not sure whether they remain correct for, let's say multi-line regexes littered with comments and irrelevant whitespace and the like. When implementing this change we may add a few more tests to that effect?

Edit: actually I suggest simply showing the adds, then you can alphabetize which is what they'll likely want.

@danmosemsft I'm not sure I understand. There have been no additions to the enum, only changes (the red and green lines in the diff, the greens replacing the reds). By keeping them paired with the original enum fields I figured it would be easier to track for what's what, but I can reformat it if I knew how to make the diff more readable/understandable ;).

I don't think we should make it alphabetical, as that would change the enum values (though it's internal... maybe it's not that big a deal).

some refer to XXXNamedReference, some to XXBackReference. I think named backreferences are still backreferences?

There's an UndefinedNamedReference (named) and an UndefinedBackReference (numerical ref). They are both back-references, indeed. You want to use UndefinedNamedBackReference? And there's MalformedNamedReference, which only ever applies to named backreferences.

In the case of an alternation, I'm not sure it is a true "backreference". There's AlternationWithMalformedNamedReference and AlternationWithUndefinedReference.

Maybe it should stick to one style or the other. I suggest the former as I think ti's more typical for compilers, etc.

Yeah, I struggled with that, as in I also tried to keep with the original naming a bit. I'll have a look. I like the "describe the problem"-versions better. I'll fix that.

ShorthandClassInCharRange contracts Character ?

I don't understand the question, sorry. If you use a shorthand-class that is not a single-class short-hand (like \r), then you cannot use it in a character class. Therefore, [a-\S] is illegal, but [\t-A] is not. Many other regex engines allow shorthand classes inside character class (bracketed) expressions, but not in .NET, apparently.

danmoseley commented 4 years ago

I can reformat it if I knew how to make the diff more readable/understandable ;).

I'm suggesting to just have the "adds". That's all they need to review. When implementing it, it will be easy to figure out which corresponds to which existing entry.

For XXXreference - what you have is fine by me if you think it's best.

I don't understand the question, sorry.

I wasn't clear, I meant if you have UnrecognizedControlCharacter should it be ShorthandClassInCharacterRange ? Naming guidelines also say to avoid contraction.

abelbraaksma commented 4 years ago

UnrecognizedControlCharacter should it be ShorthandClassInCharacterRange ?

Totally misunderstood "contracts" here in the meaning of "abbreviates" or "shortens". Got it. Missed it originally. I'm aware of the "no contraction" rule (see other renamings). Will fix ;).

abelbraaksma commented 4 years ago

I'm suggesting to just have the "adds". That's all they need to review.

@danmosemsft, something like how it is now? I just edited it. I still have to go over some other naming suggestions, but they require a moment of thought.

danmoseley commented 4 years ago

something like how it is now?

Yup, looks good - although they are all "adds" from the API review point of view

CyrusNajmabadi commented 4 years ago

Quick question: who is this API intended for? i.e. is there an existing customer asking for this? I ask because I think it would make sense for that customer to actually vet this and confirm this will solve whatever use case they had in mind here. Thanks!

abelbraaksma commented 4 years ago

@CyrusNajmabadi, I'm one such customer, and I do have a strong use case for this (a language compiler that supports regexes, in my case XPath, this allows giving better feedback). And there have been a few requests here and there, I assume they're listening in.

It's also of use for tooling like visual studio, to highlight static errors.

CyrusNajmabadi commented 4 years ago

It's also of use for tooling like visual studio, to highlight static errors.

I don't know what that means :)

and I do have a strong use case for this (a language compiler that supports regexes, in my case XPath, this allows giving better feedback)

Could you clarify how you intend to do that here? Thanks! :)

terrajobst commented 4 years ago

It's also of use for tooling like visual studio, to highlight static errors.

I don't know what that means :)

I believe he means for showing syntax errors in regexs in the IDE. BTW, how did you handle regex parsing in Roslyn? I assume you have your own parser. Do you do any validation today? If not, would this API help with that? If not, why not?

CyrusNajmabadi commented 4 years ago

BTW, how did you handle regex parsing in Roslyn? I assume you have your own parser.

Yup. I wrote my own. You can see the code here: https://github.com/dotnet/roslyn/blob/697e6309ab78bd0e36e537cb1be93bdc23af9125/src/Features/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexParser.cs

The existing .net parser is unsuitable for many reasons. First, you can DOS it. Second, it doesn't actually give you a tree back that is needed for providing suitable IDE features.

Do you do any validation today?

Yup. We have many thousands of new tests here, as well as incorporating existing regex test suites. This even helped me find about a half dozen issues with the existing .net impl where it either crashed it didn't behave as per spec :-)

If not, would this API help with that? If not, why not?

I don't think it would be particularly helpful. Primarily because it would now be an additional piece of information I would have to mirror in order to be a drop-in replacement.

danmoseley commented 4 years ago

Not everyone authors their regexes in VS. They may use another tool, or they may accept input, eg., from a file. In such a scenario, this would help suggest where the error is in a potentially large pattern, no different to a compiler error giving an error and a location rather than just saying the code is invalid.

abelbraaksma commented 4 years ago

I don't know what that means :)

@CyrusNajmabadi, if you use one of the Regex overloads that contains a static string for the regex argument, you can get editor-time feedback if the regex is faulty. This suggestion makes implementing such feedback easier. It's certainly not a silver bullet for all scenarios.

I can imagine the wish for feedback in the form of an AST, and splitting parsing and compiling, but that's beyond the scope of this proposal. This merely opens information to the public that was already there, but will now be more deterministically useful. Also, renaming the enum fields makes those cases more understandable.

Could you clarify how you intend to do that here?

That's not simple to answer, because XPath supports Astral plane characters, where BCL regexes only support BMP characters. This is a big limitation and I work around it in various ways.

Other w3c constructs are simply not available or work differently in .NET. Again, I use workarounds.

However, I do support plain .NET regexes as an extension. I used reflection to get to the underlying precise error and enum, and to offset the line and col number. That won't be necessary anymore after this.

CyrusNajmabadi commented 4 years ago

@CyrusNajmabadi, if you use one of the Regex overloads that contains a static string for the regex argument, you can get editor-time feedback if the regex is faulty. This suggestion makes implementing such feedback easier. It's certainly not a silver bullet for all scenarios.

Well.. somewhat. A core problem for the editor case is that the positions of these exceptions doesn't match string index positions. For example "Foo\\bar\n\\nbaz". You need to know how the original textual characters in the doc match the end characters that go into teh regex engine to get anything working properly.

THis is one of hte reasons i do not use the actual .net parser here. But, for simple purposes i can see it being nice to just defer to that :)

danmoseley commented 4 years ago

@pgovind perhaps you can rep this at the API review sometime?

danmoseley commented 4 years ago

@abelbraaksma I made some more tweaks to the names (broke into separate edits - one of them was just sorting). LMK if those look ok.

abelbraaksma commented 4 years ago

@danmosemsft, thanks for doing those changes, I reviewed them, and I agree this is clearer than what I had for some of these fields.

I wasn't sure about the alphabetical ordering, but it sure is easier to read this way :). LGTM.

abelbraaksma commented 4 years ago

You need to know how the original textual characters in the doc match the end characters that go into teh regex engine to get anything working properly.

Sure, some logic needs to be added to do it correctly, but only a handful of escapes are literal escapes, it's rather easy to do. Similar techniques are already used with interpolated strings and multiline strings, the AST has this information for the string type.

But this is not an end all solution, it improves error handling and allows to take corrective steps, or give better messages to users, which is a plus. And it maintains the hierarchy, people aren't required to use it, and existing code keeps working.

CyrusNajmabadi commented 4 years ago

but only a handful of escapes are literal escapes, it's rather easy to do.

I dunno. Things like \u and \U escapes are not fun. Figuring out surrogate characters and the like.

Similar techniques are already used with interpolated strings and multiline strings, the AST has this information for the string type.

The roslyn AST does not contain this information for strings. It's rather unfortunate, but there hasn't been enough demand to expose that yet.

But this is not an end all solution, it improves error handling and allows to take corrective steps, or give better messages to users, which is a plus.

If you intend on using this, then that's enough for me. I just wanted to make sure there was an actual customer that would be able to vet the suitability of this API. I was only bringing that up as Roslyn won't be using this ourselves. So i didn't want this to be speculatively made, and then never used.

Thanks!

abelbraaksma commented 4 years ago

The roslyn AST does not contain this information for strings.

Ah, I didn't know, we recently fixed this for F#, but that has always had parameterized strings with highlighting.

So i didn't want this to be speculatively made, and then never used.

Understand. There are currently already more cases in the wild, including in the tests of regexes, that use reflection to get this information. So this fix is directly applicable to that. Online regex parsers may benefit, but that's indeed speculation. I may personally update the regex compiler (on codeproject, it precompiles a regex into an assembly and saves it for linking).

There's also a request out there for coloring regexes in C# editor in VS, but I guess that wouldn't be implemented before the AST is also updated.

abelbraaksma commented 4 years ago

In addition, there was this decision from an earlier review process: https://github.com/dotnet/runtime/issues/13942#issuecomment-70185134:

P1: We currently throw an ArgumentException for parse errors and we agree we should try to at least add more information like the character index. We should consider adding a RegexParseException that inherits from ArgumentException

And I forgot about the i18n problems (mentioned here: https://github.com/dotnet/runtime/issues/13942#issuecomment-68141257), which makes it virtually impossible currently to provide tailored information on specific errors. I'll add that to the original text.

danmoseley commented 4 years ago

As an aside, not part of this proposal, I wonder whether the RegexParseException should include the pattern - at least if it's not too huge. Akin to how we include the path in IOException. That would be helpful in server logs but also make it easier to interpret the offset without a debugger.

CyrusNajmabadi commented 4 years ago

There's also a request out there for coloring regexes in C# editor in VS,

We already do that: https://github.com/dotnet/roslyn/pull/23984 (https://github.com/dotnet/roslyn/pull/23984#issuecomment-354707104). I was the author of that work :)

but I guess that wouldn't be implemented before the AST is also updated.

The roslyn ast doesn't expose anything here (beyond just 'this is a string literal'). So i hand-wrote a compliant dotnet regex parser to match the one from the runtime. As mentioned above, this was for several reasons:

the runtime system doesn't understand C# escapes. So support is needed for that.
the runtime system didn't expose positions/spans of issues. So support was needed for that.
the runtime system can be DOSed, which makes it unacceptable in a live editing environment.

Effectively, the runtime regex system works fine for running programs. But it's ill suited for an actual editor (where latency and low memory are essential).

abelbraaksma commented 4 years ago

We already do that: dotnet/roslyn#23984 (dotnet/roslyn#23984 (comment)). I was the author of that work :)

Cool! I wasn't aware, my recent projects being in F# and all. I can understand why you took the approach you did. Also, as far as I know, parsing and compiling isn't split at the moment, and doing a compile of a regex just to find if it has an error in an editor doesn't strike me as the right path.

danmoseley commented 4 years ago

parsing and compiling isn't split at the moment,

You can instantiate the regex (either explicitly or through the static method) without passing RegexOptions.Compiled and arguably it doesn't compile it - it will least not emit IL, which is the expensive part. Using the regex later with the Compiled flag will reuse the parsing work.

This describes the moving parts https://github.com/dotnet/runtime/blob/master/src/libraries/System.Text.RegularExpressions/src/README.md

abelbraaksma commented 4 years ago

@danmosemsft, I'm aware of that flag (as mentioned earlier, I use a tool that compiles regexes into an assembly and merges those together), but that wasn't what I meant, sorry for the confusion ;).

I see that creating a regex essentially only calls RegexParser.Parse. I think my point was that if you create a regex, this is quite expensive, and normally only parsing isn't. I was under the impression that more was done here than "just parsing" (and I don't mean the Compile flag). I see that a RegexTree is created, which I assume represents a form of AST.

Anyway, I'm digressing...

(as an aside: that page serves as a good refresher, thanks!)

danmoseley commented 4 years ago

I can't figure out how to hit AlternationHasMalformedCondition either. It's difficult to follow what's going on, but something like (?()a|b) which surely has an invalid condition is treated as a condition that is always true. Likely unintentionally. So it may never be possible to throw for an invalid condition without making a breaking change, unless there is some other condition I haven't tried that's currently treated as invalid that could be changed to the more specific error.

abelbraaksma commented 4 years ago

The only thing that comes close is AlternationHasNamedCapture, which is like a malformed condition, but more specific. I had a look at the code at the time and it looked like the more general case will never hit as it is currently written.

It may be something from a moment during development where certain conditions where indeed invalid, and later changes have allowed more freedom in the condition clause? Making it redundant in the process?

abelbraaksma commented 4 years ago

Another train of thought is that technically one can consider certain conditions malformed, but you get a different error instead, like missing parens, or other errors that could essentially fit in more categories than one.

A change could then be made to make certain imprecise errors more specific, which in turn may lead to the AlternationHasMalformedCondition taking on a useful role.

But you could call this a case of potato-potato ;)

dotnet / runtime