dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Make enum RegexParseError and RegexParseException public #38872

Closed abelbraaksma closed 4 years ago

abelbraaksma commented 4 years ago

Background and Motivation

A regular expression object made with System.Text.Regex is essentially an ad-hoc compiled sub-language that's widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.

We already have an internal RegexParseException and two properties: Error and Offset, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException is raised, it is in fact a RegexParseException which inherits ArgumentException.

I've checked the existing code and I propose we make RegexParseException and RegexParseError public, these are pretty self-describing at the moment, though the enum cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.

Use cases

Related requests and proposals

Proposed API

The current API already exists but isn't public. The definitions are as follows:

    [Serializable]
-    internal sealed class RegexParseException : ArgumentException
+    public class RegexParseException : ArgumentException
    {
        private readonly RegexParseError _error; // tests access this via private reflection

        /// <summary>Gets the error that happened during parsing.</summary>
        public RegexParseError Error => _error;

        /// <summary>Gets the offset in the supplied pattern.</summary>
        public int Offset { get; }

        public RegexParseException(RegexParseError error, int offset, string message) : base(message)
        {
+            // add logic to test range of 'error' and return UnknownParseError if out of range
            _error = error;
            Offset = offset;
        }

        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            base.GetObjectData(info, context);
            info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
        }
    }

And the enum with suggested names for a more discoverable naming scheme. I followed "clarity over brevity" and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.

I'd suggest we add a case for unknown conditions, something like UnknownParseError = 0, which could be used if users create this exception by hand with an invalid enum value.

Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.

-internal enum RegexParseError
+public enum RegexParseError
{
+    UnknownParseError = 0,    // do we want to add this catch all in case other conditions emerge?
+    AlternationHasComment,
+    AlternationHasMalformedCondition,  // *maybe? No tests, code never hits
+    AlternationHasMalformedReference,  // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+    AlternationHasNamedCapture,        // like @"(?(?<x>)true|false)"
+    AlternationHasTooManyConditions,   // like @"(?(foo)a|b|c)"
+    AlternationHasUndefinedReference,  // like @"(x)(?(3)x|y)" or @"(?(1))"
+    CaptureGroupNameInvalid,           // like @"(?< >)" or @"(?'x)"
+    CaptureGroupOfZero,                // like @"(?'0'foo)" or @("(?<0>x)"
+    ExclusionGroupNotLast,             // like @"[a-z-[xy]A]"
+    InsufficientClosingParentheses,    // like @"(((foo))"
+    InsufficientOpeningParentheses,    // like @"((foo)))"
+    InsufficientOrInvalidHexDigits,    // like @"\uabc" or @"\xr"
+    InvalidGroupingConstruct,          // like @"(?" or @"(?<foo"
+    InvalidUnicodePropertyEscape,      // like @"\p{Ll" or @"\p{ L}"
+    MalformedNamedReference,           // like @"\k<"
+    MalformedUnicodePropertyEscape,    // like @"\p{}" or @"\p {L}"
+    MissingControlCharacter,           // like @"\c"
+    NestedQuantifiersNotParenthesized  // @"abc**"
+    QuantifierAfterNothing,            // like @"((*foo)bar)"
+    QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+    ReversedCharacterRange,            // like @"[z-a]"   (only in char classes, see also ReversedQuantifierRange)
+    ReversedQuantifierRange,           // like @"abc{3,0}"  (only in quantifiers, see also ReversedCharacterRange)
+    ShorthandClassInCharacterRange,    // like @"[a-\w]" or @"[a-\p{L}]"
+    UndefinedNamedReference,           // like @"\k<x>"
+    UndefinedNumberedReference,        // like @"(x)\2"
+    UnescapedEndingBackslash,          // like @"foo\" or @"bar\\\\\"
+    UnrecognizedControlCharacter,      // like @"\c!"
+    UnrecognizedEscape,                // like @"\C" or @"\k<" or @"[\B]"
+    UnrecognizedUnicodeProperty,       // like @"\p{Lll}"
+    UnterminatedBracket,               // like @"[a-b"
+    UnterminatedComment,
}

* About IllegalCondition, this is thrown inside a conditional alternation like (?(foo)x|y), but appears to never be hit. There is no test case covering this error.

Usage Examples

Here's an example where we use the additional info to give more detailed feedback to the user:

public class TestRE
{
    public static Regex CreateAndLog(string regex)
    {
        try
        {
            var re = new Regex(regex);
            return re;
        }
        catch(RegexParseException reExc)
        {
            switch(reExc.Error)
            {
                case RegexParseError.TooFewHex:
                    Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
                    break;
                case RegexParseError.UndefinedBackref:
                    Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
                    break;
                case RegexParseError.UnknownUnicodeProperty:
                    Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
                    break;
                // ... etc
            }
            return null;
        }
    }
}

Alternative Designs

Alternatively, we may remove the type entirely and merely throw an ArgumentException. But it is likely that some people rely on the internal type, even though it isn't public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.

Risks

The only risk I can think of is that after exposing this exception, people would like even more details. But that's probably a good thing and only improves the existing API.

Note that:

[danmose: made some more edits]

pgovind commented 4 years ago

@pgovind perhaps you can rep this at the API review sometime?

Yup, I'll take this forward at API review. @terrajobst , any idea when we might be able to squeeze this in?

danmoseley commented 4 years ago

@pgovind you can see here https://apireview.net/backlog

abelbraaksma commented 4 years ago

https://apireview.net/backlog gives a 403 today...

danmoseley commented 4 years ago

@abelbraaksma it's working now, @terrajobst fixed the sub it was using. thanks for letting us know.

danmoseley commented 4 years ago

Correction, I misunderstood him and he's not fixed it yet. You'll have to look at https://github.com/dotnet/runtime/issues?q=is%3Aopen+is%3Aissue+label%3Aapi-ready-for-review+sort%3Aupdated-asc -- they do "blocking" first, then everything else oldest first.

pgovind commented 4 years ago

Just FYI, I've been dialing into the API review meetings to keep an eye on this when it comes up. It hasn't come up yet, but it's close(unless something else takes priority).

abelbraaksma commented 4 years ago

@pgovind, tx (though it's near the bottom of the list of 14 open issues ;). Should I be present or available at the meeting, or at least reachable should questions arise?

pgovind commented 4 years ago

Ha, you're right, I should've said "relatively close" :) Even though there's 14 on that list now, some of them may get punted or new ones added for various reasons. And that's why it's hard to pick an exact date where an issue from the backlog will definitely get picked up :) Being present is not necessary, but you are of course always welcome to tune it to the meeting stream and if this issue gets picked up, I believe you can chat with the reviewers there? Otherwise, any questions/(changes to the API) that arise will be posted here and we'll have a chance to review them.

abelbraaksma commented 4 years ago

@pgovind, interesting, so I actually did get an invite to the meeting by mail. Unfortunately, it was only two hours before it started and I wasn't around at the time. Would've been nice to listen in :). (and now I'm of course curious of the results).

pgovind commented 4 years ago

it was only two hours

You didn't miss it. It's tomorrow (Aug 11) :)

terrajobst commented 4 years ago

@abelbraaksma

Apologies -- my email was confusing. It's scheduled for tomorrow August 11, 10 AM PDT or which should be 19:00 CEST (AFAIK your timezone). The meeting is scheduled for two hours and this issue will be discussed an hour into the meeting. Hope this helps :-)

abelbraaksma commented 4 years ago

@terrajobst, @pgovind, Thanks, you're right, I misread: it's indeed today ;) (I actually clicked the meeting link and it said there was nobody, I drew the wrong conclusion :P). As it looks now, I'll be online by then.

And yes, I'm based in Amsterdam, which is CEST.

terrajobst commented 4 years ago

Video

namespace System.Text.RegularExpressions
{
    [Serializable]
    public sealed class RegexParseException : ArgumentException
    {
        public RegexParseException(RegexParseError error, int offset);
        private RegexParseException(SerializationInfo info, StreamingContext context)
        {
            // It means someone modified the payload.
            throw new NotImplementedException();
        }
        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            // We'll serialize as an instance of ArgumentException
        }
        public RegexParseError Error { get; }
        public int Offset { get; }
    }
    public enum RegexParseError
    {
        Unknown,
        AlternationHasComment,
        AlternationHasMalformedCondition,
        AlternationHasMalformedReference,
        AlternationHasNamedCapture,
        AlternationHasTooManyConditions,
        AlternationHasUndefinedReference,
        CaptureGroupNameInvalid,
        CaptureGroupOfZero,
        ExclusionGroupNotLast,
        InsufficientClosingParentheses,
        InsufficientOpeningParentheses,
        InsufficientOrInvalidHexDigits,
        InvalidGroupingConstruct,
        InvalidUnicodePropertyEscape,
        MalformedNamedReference,
        MalformedUnicodePropertyEscape,
        MissingControlCharacter,
        NestedQuantifiersNotParenthesized,
        QuantifierAfterNothing,
        QuantifierOrCaptureGroupOutOfRange,
        ReversedCharacterRange,
        ReversedQuantifierRange,
        ShorthandClassInCharacterRange,
        UndefinedNamedReference,
        UndefinedNumberedReference,
        UnescapedEndingBackslash,
        UnrecognizedControlCharacter,
        UnrecognizedEscape,
        UnrecognizedUnicodeProperty,
        UnterminatedBracket,
        UnterminatedComment
    }
}
terrajobst commented 4 years ago

@danmosemsft, can we consider this for .NET 5? Seems easy enough :-)

abelbraaksma commented 4 years ago

We should probably preserve the current ordering, if only to discourage the urge to make future additions alphabetically sorted.

In Jeremy Barton's words at the meeting: "it would be good if people using reflection currently could fix their code to the public version by doing a x + 1 on each value (due to the new Unknown on top)". I agree that this is reasonable and to keep the original order, tooling will do the alphabetizing anyway.

danmoseley commented 4 years ago

@danmosemsft, can we consider this for .NET 5? Seems easy enough

Team members are all working on work required for 5.0. If a community member volunteers and it can get merged before we branch on Monday, sure.

abelbraaksma commented 4 years ago

@danmosemsft, I can try to get something in by Friday, though I'm not sure about the point of improving the error messages, which is something that tends to need a lot of back and forth to iron out. The renaming part, and fixing the tests should be relatively straightforward.

Of course, than it still depends on whether it can be reviewed in time to be merged in time.

danmoseley commented 4 years ago

Sure. Totally up to you

danmoseley commented 4 years ago

BTW @abelbraaksma the only tricky part of this I expect will be adding the test for binary serialization to/from the new type, to address the concern that this not break. We have tests for this

The bulk of such testing is done through blobs in this file - they test to and from for .NET Framework and .NET Core: https://raw.githubusercontent.com/danmosemsft/runtime/7a1ff8272bd8afe74ed1b98b8c7d1f6c6a6d2a07/src/libraries/System.Runtime.Serialization.Formatters/tests/BinaryFormatterTestData.cs

It's fiddly to update those blobs. But happily I see there's already a specific test however for serializing a RegexParseException on .NET Core and deserializating as an ArgumentException on .NET Framework! https://github.com/danmosemsft/runtime/blob/7a1ff8272bd8afe74ed1b98b8c7d1f6c6a6d2a07/src/libraries/System.Runtime.Serialization.Formatters/tests/BinaryFormatterTests.cs#L137 It maybe should verify the message was preserved but I think that test should be sufficient to protect the behavior they're asking for.

GrabYourPitchforks commented 4 years ago

Adding a test for BinaryFormatter serialization should be relatively straightforward. Serialize an instance of the new exception, then deserialize it, and validate that deserialized.GetType() is ArgumentException.

danmoseley commented 4 years ago

That's what that test does, basically

abelbraaksma commented 4 years ago

@danmosemsft, thanks for the pointers on those tests! I'm actually not sure I can finish today for the simple reason that I couldn't get the build succeed (only to find out that I needed to update VS, oops). Behind a relatively slow connection, the download of 5GB for the update can take a while...

Oh, and I get 502's from https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/a8a526e9-91b3-4569-ba2d-ff08dbb7c110/nuget/v3/flat2/runtime.win-x86.microsoft.netcore.coredistools/1.0.1-prerelease-00005/runtime.win-x86.microsoft.netcore.coredistools.1.0.1-prerelease-00005.nupkg, but my hope is that they'll disappear magically.

While I think the changes i made are sound (the renaming part, at least), I'd prefer to at least successfully build it locally again before submitting a PR.

Anyway, I expect to submit a working PR somewhere tonight (which may still be day for you :). We'll see how far we get.

danmoseley commented 4 years ago

@abelbraaksma I don't know your schedule - we don't branch until Monday, if that helps.

abelbraaksma commented 4 years ago

Hmm, strange, I would swear I could build stuff before. After updating everything (I ran into this briefly too: https://github.com/dotnet/runtime/issues/40283), I get:

C:\Users\Abel.nuget\packages\microsoft.net.compilers.toolset\3.8.0-2.20403.2\tasks\netcoreapp3.1\Microsoft.CSharp.Core.targets(70,5): error : Cannot use stream for resource [D:\Projects\Github\Runtim e.dotnet\shared\Microsoft.NETCore.App\5.0.0-preview.6.20305.6\Microsoft.NETCore.App.deps.json]: No such file or directory [D:\Projects\Github\Runtime\tools-local\tasks\installer.tasks\installer.tasks .csproj]

Since this points to the directory in the local .dotnet path, it appears that either (a) that location it's broken or (b) it's not downloading/installing the things properly there. Not quite sure what's happening here, though ~I suspect git clean -xdf, which appears not to be able to clean everything, causes this.~ EDIT: it looks like dotnet.exe doesn't get killed properly when I use build.cmd, leading to open handles, leading to not cleaning properly. Let's see what we get now.

I don't know your schedule - we don't branch until Monday, if that helps.

I may be needing the weekend after all ;).

danmoseley commented 4 years ago

I have never seen that error and if I got it I would probably git clean -fdx on the repo and git pull and then build again explicitly using the dotnet out of the .dotnet folder (which isn't required, but why not)

pgovind commented 4 years ago

I've never seen that error either. If your build still fails, tell us your environment too? Or, even better, try it again by passing in -bl in the command line to generate a bin log.

I can still most likely review your PR over the weekend if you put it up, so you have some time :)

abelbraaksma commented 4 years ago

Thanks for the support. The error is gone (I edited the comment with the cause: dotnet.exe was hanging and made git clean fail in creative ways).

EDIT: Scratch that, what I posted here before about ml64.exe seems to be caused by lib.exe not being found, same as this one: https://github.com/dotnet/runtime/issues/13114. Though I actually did do a restart minutes ago.

I'm in an x64 Native Tools Command Prompt (and tried a normal Administrator cmd prompt too, same error). ~I'll check why it isn't available.~ it's in the path and available in the command window that I use to build this. :thinking:

abelbraaksma commented 4 years ago

Ok, this appears to be caused by a new cmd.exe, which probably doesn't have lib.exe on the same path (it doesn't inherit it?). Since I have basically all Visual Studio's and Windows SDK's from the last 20 years or so, something has gotten amiss here. I need to find the command in the build that fires up a new cmd.exe and have it inherit the parent, or just update the global path and add the latest version on top.

Anyway, 1AM here, tomorrow things will be better. Thanks for the quick replies so far, it is much appreciated!

danmoseley commented 4 years ago

Repairing your VS (the newest one?) might make sure the environment variables are correct -not sure though.

abelbraaksma commented 4 years ago

Repair didn't help, nor did any other things I tried to make the ml64.exe discoverable (it's there, it's on the path, I can run it etc etc).

So I decided to fire up a VM with a clean Win10 and a clean recent VS2019 and go from there. While this took some time, I have managed to get the initial build going. Hopefully the rest will go smooth now ;)

image

danmoseley commented 4 years ago

Yeesh Msybe you can fix the env vars on your original machine by looking at the ones in your VM..

abelbraaksma commented 4 years ago

@danmosemsft the starting point is there, but I'll need a little help on how to proceed w.r.t. inline documentation (I assume that's supposed to be added here as well): https://github.com/dotnet/runtime/pull/40902.

Yeesh Msybe you can fix the env vars on your original machine by looking at the ones in your VM..

I did something more "brute force", I just copied the whole build dir over to my primary dev machine. Turns out this allowed me to build & run tests for System.Text.RegularExpressions. Didn't try much else yet as this is all that's needed right now anyway.

(I also created an Azure VM, but for the last hour or so that has still been building, it is not very fast...)

abelbraaksma commented 4 years ago

For posterity: this has now been implemented, details are in the PR: #40902. Thanks to all for reviewing this proposal and for the valuable feedback. Special thanks to @danmosemsft to help get this past the finish line, and assisting with the parts in this process I was yet unfamiliar with.

It was in the nick of time to make it into .NET 5 đŸ¥³.

danmoseley commented 4 years ago

We'd welcome other contributions if you're interested @abelbraaksma . There's plenty up for grabs. If you want another in regex specifically there are opportunities for perf improvement or features eg https://github.com/dotnet/runtime/issues/25598 (that one I think is a bit involved..)