firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.2k stars 198 forks source link

Impossible to use \"" in the .net 7 regex #2286

Open franckleveque opened 1 month ago

franckleveque commented 1 month ago

Bug Description

I need to capture elements that contains a " When I add the token \" or \"" neither work and I have a pattern error :

\" This token has no special meaning and has thus been rendered erroneous " An unescaped delimiter must be escaped; in most languages with a backslash ()

The generated C# code however works like a charm :

using System; using System.Text.RegularExpressions;

public class Example { public static void Main() { string pattern = @"\""\w+\"""; string input = @"this a ""test"""; RegexOptions options = RegexOptions.Multiline;

    foreach (Match m in Regex.Matches(input, pattern, options))
    {
        Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
    }
}

}

it will match correctly the "test" in the target string

Reproduction steps

Use as a .Net7 Regular Expression \""\w+\"" Use as text : this a "test"

You obtain the following pattern error :

\" This token has no special meaning and has thus been rendered erroneous " An unescaped delimiter must be escaped; in most languages with a backslash () \" This token has no special meaning and has thus been rendered erroneous " An unescaped delimiter must be escaped; in most languages with a backslash ()

Expected Outcome

No error shown as the pattern is correct.

Browser

Include browser name and version Microsoft Edge for Business Version 125.0.2535.67 (Version officielle) (64 bits)

OS

Include OS name and version Windows 11

firasdib commented 1 month ago

Can you post a regex101 link? I can't replicate it.

md-at-slashwhy commented 3 weeks ago

Seems like the Problem might be related to you using verbatim strings which don't use \" vor escaping quotes but rather use "". Try removing the @ for declaring your strings and replace your "" with \" and escaping the existing \ as \\ Alternatively, you could use raw literals those should work with most valid regex patterns without escaping. Example:

using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        const string verbatimPattern = @"""\w+""";
        const string defaultPattern = "\"\\w+\"";
        const string rawLiteralPattern = """
                                         "\w+"
                                         """;
        var input = @"this a ""test""";
        var options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, verbatimPattern, options)) Console.WriteLine("[verbatim]'{0}' found at index {1}.", m.Value, m.Index);

        foreach (Match m in Regex.Matches(input, defaultPattern, options)) Console.WriteLine("[default]'{0}' found at index {1}.", m.Value, m.Index);

        foreach (Match m in Regex.Matches(input, rawLiteralPattern, options)) Console.WriteLine("[raw literal]'{0}' found at index {1}.", m.Value, m.Index);
    }
}
working-name commented 3 weeks ago

@md-at-slashwhy Can you confirm this is how you input the string? https://regex101.com/r/XL7xtQ/1

md-at-slashwhy commented 3 weeks ago

Yes, that was my input. On furhter inspection, I believe the issue report has markdown formatting issues which makes it misleading. I was able to reproduce the error sequence as such: https://regex101.com/r/L8u10o/1 giving:

\" This token has no special meaning and has thus been rendered erroneous
" An unescaped delimiter must be escaped; in most languages with a backslash (\)
\" This token has no special meaning and has thus been rendered erroneous
" An unescaped delimiter must be escaped; in most languages with a backslash (\)

Which I think might be the intended error report. I could imagine the report did contain backslashes which haven't been escaped and thus don't show up in the rendered Markdown. An intended report of

Reproduction steps
Use as a .Net7 Regular Expression \""\\w+\""
Use as text : this a "test"

You obtain the following pattern error :

\" This token has no special meaning and has thus been rendered erroneous
" An unescaped delimiter must be escaped; in most languages with a backslash (\)
\" This token has no special meaning and has thus been rendered erroneous
" An unescaped delimiter must be escaped; in most languages with a backslash (\)

would fit. It would also work for the updated code snippet that would be created:

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"\""\w+\""";
        string input = @"this is a ""test""";
        RegexOptions options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

EDIT: Just saw the title, so I'm pretty sure that markdown formatting is the issue here. Also tried with a test string of this a \"test\" which didn't reproduce anymore, so I'm somewhat confident in the intended report.

franckleveque commented 3 weeks ago

Hi, you're right I used the same input described in https://regex101.com/r/L8u10o/1. The produced C# code by regex101 has no issue. the problem lies is the website interpretation of the regex which makes debugging quite hard.

md-at-slashwhy commented 3 weeks ago

The problem seems to lie in the fact, that the page uses \ to escape the quote char (") although the prefix @" suggest, that a verbatim string is used which would not use \ as an escape character.

working-name commented 3 weeks ago

TL;DR @" ""\w+\"" "; gets passed as "\w+\" to the engine, so maybe treat "" as a single " in CM's tokenizer when the raw string is @"?

I think I'm getting the point of confusion here. If .NET doesn't care about the \ before an escaped " as "", why does the site?

Firas can probably clear this up since I'd just be making assumptions. I generally treat .NET regex as an outlier/exception to the rule for regex101.com. I know it's javascript parsing the input based on that flavor's expected rules and capabilities.

Here are a couple things that might at least offer some insight into the behavior:

  1. regex101.com does NOT parse your regex input through that flavor's string parser and then pass it down to the engine. You're instructing the regex engine directly. This is usually where the confusion lies when copy pasting a string version of the regex (instead of its parsed output) from your favorite programming language and seeing errors on the site...
  2. for @" ""\w+\"" "; specifically, .NET 7 pushes the string below to the regex engine. The regex engine is correct to not error out for a superfluous escaping of a non-metacharacter ".
    "\w+\" 
  3. The code generator does NOT check your regex's ability to run in that programming language. It just escapes characters or dresses the pattern in whatever function is appropriate for the target you pick - all in javascript.

Off topic-ish: Some stuff .NET 7 does with raw strings that I find intriguing.

// this confuses the living snot out of .NET 7. 
// It just picks @" rather than @""" with a single regex token: \w+
// I don't know how it chooses it.
string pattern = @"""\w+""";
// This one refuses to be considered @""" also. Instead, @" which results in 
// a ", followed by a space character, and so on.
string pattern = @""" \w+ """;
franckleveque commented 3 weeks ago

Hi, when you use @ to begin a .net string it takes most of the string as is. Only a quote escape sequence ("") isn't interpreted literally; it produces one double quotation mark.

https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/verbatim

Le sam. 8 juin 2024, 07:08, Alan @.***> a écrit :

TL;DR @" ""\w+\"" "; gets passed as "\w+\" to the engine, so maybe treat "" as a single " in CM's tokenizer when the raw string is @"?

I think I'm getting the point of confusion here. If .NET doesn't care about the \ before an escaped " as "", why does the site?

Firas can probably clear this up since I'd just be making assumptions. I generally treat .NET regex as an outlier/exception to the rule for regex101.com. I know it's javascript parsing the input based on that flavor's expected rules and capabilities.

Here are a couple things that might at least offer some insight into the behavior:

  1. regex101.com does NOT parse your regex input through that flavor's string parser and then pass it down to the engine. You're instructing the regex engine directly. This is usually where the confusion lies when copy pasting a string version of the regex (instead of its parsed output) from your favorite programming language and seeing errors on the site...

  2. for @" ""\w+\"" "; specifically, .NET 7 pushes the string below to the regex engine. The regex engine is correct to not error out for a superfluous escaping of a non-metacharacter ".

    "\w+\"

  3. The code generator does NOT check your regex's ability to run in that programming language. It just escapes characters or dresses the pattern in whatever function is appropriate for the target you pick - all in javascript.


Off topic-ish: Some stuff .NET 7 does with raw strings that I find intriguing.

// this confuses the living snot out of .NET 7. // It just picks @" rather than @""" with a single regex token: \w+// I don't know how it chooses it.string pattern = @"""\w+""";// This one refuses to be considered @""" also. Instead, @" which results in // a ", followed by a space character, and so on.string pattern = @""" \w+ """;

— Reply to this email directly, view it on GitHub https://github.com/firasdib/Regex101/issues/2286#issuecomment-2155812210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACA26VEBILH4L2AEHVUYUTZGKGU7AVCNFSM6AAAAABIO6FOQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVHAYTEMRRGA . You are receiving this because you authored the thread.Message ID: @.***>

franckleveque commented 3 weeks ago

Your off-topic question is interesting, if I didn't checked Microsoft documentation I would have said that only two way to declare a string existed (string interpolation beginning by $ is excluded from the following) :

string pattern = "some text"; string pattern = @"some text";

The first way use a standard escaping of special characters with \ like ", carriage return, unicode characters,...

The second is using the verbatim escaping explained in the previous message.

It seems that since C# 11 (.net 7) a new way is using """ at the beginning and the end of a string to allow brut string without any escaping sequence interpretation but it is not preceded by @. However I never used it.

https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/

Le sam. 8 juin 2024, 08:18, Franck LEVEQUE @.***> a écrit :

Hi, when you use @ to begin a .net string it takes most of the string as is. Only a quote escape sequence ("") isn't interpreted literally; it produces one double quotation mark.

https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/verbatim

Le sam. 8 juin 2024, 07:08, Alan @.***> a écrit :

TL;DR @" ""\w+\"" "; gets passed as "\w+\" to the engine, so maybe treat "" as a single " in CM's tokenizer when the raw string is @"?

I think I'm getting the point of confusion here. If .NET doesn't care about the \ before an escaped " as "", why does the site?

Firas can probably clear this up since I'd just be making assumptions. I generally treat .NET regex as an outlier/exception to the rule for regex101.com. I know it's javascript parsing the input based on that flavor's expected rules and capabilities.

Here are a couple things that might at least offer some insight into the behavior:

  1. regex101.com does NOT parse your regex input through that flavor's string parser and then pass it down to the engine. You're instructing the regex engine directly. This is usually where the confusion lies when copy pasting a string version of the regex (instead of its parsed output) from your favorite programming language and seeing errors on the site...

  2. for @" ""\w+\"" "; specifically, .NET 7 pushes the string below to the regex engine. The regex engine is correct to not error out for a superfluous escaping of a non-metacharacter ".

    "\w+\"

  3. The code generator does NOT check your regex's ability to run in that programming language. It just escapes characters or dresses the pattern in whatever function is appropriate for the target you pick - all in javascript.


Off topic-ish: Some stuff .NET 7 does with raw strings that I find intriguing.

// this confuses the living snot out of .NET 7. // It just picks @" rather than @""" with a single regex token: \w+// I don't know how it chooses it.string pattern = @"""\w+""";// This one refuses to be considered @""" also. Instead, @" which results in // a ", followed by a space character, and so on.string pattern = @""" \w+ """;

— Reply to this email directly, view it on GitHub https://github.com/firasdib/Regex101/issues/2286#issuecomment-2155812210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACA26VEBILH4L2AEHVUYUTZGKGU7AVCNFSM6AAAAABIO6FOQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVHAYTEMRRGA . You are receiving this because you authored the thread.Message ID: @.***>

md-at-slashwhy commented 3 weeks ago

Raw literals don't use @ in front. You'd want to only use """. Verbatim strings (@") can be combined with interpolation ($") but not with raw literals ("""). Your example is a verbatim string starting with an escaped " for raw literals it should be string pattern = """ "\w+" """; (spaces for emphasis)

Off topic-ish: Some stuff .NET 7 does with raw strings that I find intriguing.

// this confuses the living snot out of .NET 7. 
// It just picks @" rather than @""" with a single regex token: \w+
// I don't know how it chooses it.
string pattern = @"""\w+""";
// This one refuses to be considered @""" also. Instead, @" which results in 
// a ", followed by a space character, and so on.
string pattern = @""" \w+ """;

I didn't look into what exactly the definition of "Delimiter" in regex101 is, but that is the source of confusion for me. The UI suggest I'd be working in a verbatim string (indicated by the @" prefix). The input \""\w+\"" would therefore be the string \"\w+\" and that should be a valid regex (even with superfluous escape). It also seems like \" binds more strongly than "", which would be a bug in a verbatim string and would explain the observed errors. Maybe a switch to raw literals would relief the headache for all parties?