dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Regex Escape Sequence inconsistency in C# compared to C++ and VBScript #92379

Closed vsfeedback closed 11 months ago

vsfeedback commented 11 months ago

This issue has been moved from a ticket on Developer Community.


Hello,

I discovered an unexpected behavior related to regex parser in C# and I need to take a decision in regards to it(to find if there is a way to make it working or replace the regex parser from C# with regex parser from C++). In our company we have a big project with modules written in many languages(from desktop to web and then scripting languages) Recently we discovered that C# parser is throwing and exception while parsing a regex, but the same time another parser that uses C++ regex did not, and another one from VBScript is working just fine.

The used pattern value is \listBox

This is the example:

=========================C#=========================

string pattern = "\listBox"; Regex.IsMatch("listBox", pattern);

It is throwing: ‘Invalid pattern ‘\listBox’ at offset 2. Unrecognized escape sequence \l.’

========================C++=========================

std::string pattern = "\listBox"; std::regex regex(pattern); std::cmatch match; std::regex_match("listBox", match, regex);

std::cout << match.size();

It will print 1, as intended

======================VBScript======================

Set vbRegEx = CreateObject("VBScript.RegExp") vbRegEx.Pattern = "\listBox" ' Set matchList = vbRegEx.Execute("listBox")

print matchList.Count

It will print 1, as intended

So my question is why there is this inconsistency in C#? Is there a way to avoid it?

I am thinking to remove .NET regex from all our projects and use a C++ wrapper instead to have a consistency over all projects but first I would really like to understand if there is any way to overcome to this problem.

Kind Regards, Cristian


Original Comments

Feedback Bot on 9/13/2023, 03:49 AM:

(private comment, text removed)


Original Solutions

(no solutions)

ghost commented 11 months ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
_This issue has been moved from [a ticket on Developer Community](https://developercommunity.visualstudio.com/t/Regex-Escape-Sequence-inconsistency-in-C/10464033)._ --- Hello, I discovered an unexpected behavior related to regex parser in C# and I need to take a decision in regards to it(to find if there is a way to make it working or replace the regex parser from C# with regex parser from C++). In our company we have a big project with modules written in many languages(from desktop to web and then scripting languages) Recently we discovered that C# parser is throwing and exception while parsing a regex, but the same time another parser that uses C++ regex did not, and another one from VBScript is working just fine. The used pattern value is \listBox This is the example: =========================C#========================= string pattern = "\\listBox"; Regex.IsMatch("listBox", pattern); It is throwing: ‘Invalid pattern ‘\listBox’ at offset 2. Unrecognized escape sequence \l.’ ========================C++========================= std::string pattern = "\\listBox"; std::regex regex(pattern); std::cmatch match; std::regex_match("listBox", match, regex); std::cout << match.size(); It will print 1, as intended ======================VBScript====================== Set vbRegEx = CreateObject("VBScript.RegExp") vbRegEx.Pattern = "\listBox" ' Set matchList = vbRegEx.Execute("listBox") print matchList.Count It will print 1, as intended So my question is why there is this inconsistency in C#? Is there a way to avoid it? I am thinking to remove .NET regex from all our projects and use a C++ wrapper instead to have a consistency over all projects but first I would really like to understand if there is any way to overcome to this problem. Kind Regards, Cristian --- ### Original Comments #### Feedback Bot on 9/13/2023, 03:49 AM: (private comment, text removed) --- ### Original Solutions (no solutions)
Author: vsfeedback
Assignees: -
Labels: `area-System.Text.RegularExpressions`
Milestone: -
stephentoub commented 11 months ago

Is there a way to avoid it?

You can specify the RegexOptions.ECMAScript flag when constructing the Regex. One of the things that impacts is whether an unrecognized escape character throws or is just treated as the character.

claudiudc commented 11 months ago

@stephentoub I've already tried it, however it does not truly make it consistent.

For example running another example:

string pattern = "\kstBox"; Regex.IsMatch("kstBox", pattern, RegexOptions.ECMAScript);

It is throwing: System.Text.RegularExpressions.RegexParseException: 'Invalid pattern '\kstBox' at offset 3. Malformed \k<...> named back reference.'

But this works in C++ and it does match.

I remember I found another scenarios because I already tested it in hope ECMAScript will make it consistent, but it does not really make it.

stephentoub commented 11 months ago

That's because \k has special meaning: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference#backreference-constructs

There is no regex syntax that is 100% portable across all languages and environments. See https://arxiv.org/abs/2105.04397 as a very rigorous study of the subject.

claudiudc commented 11 months ago

Was it considered or debated an eventual possibility to introduce a new regex option that can align to C++ standard? I am thinking now to stop using .NET Regex for the feature, and use a C++ wrapper everywhere, however what I did not tested yet is possible performance overhead introduced due to many calls from managed to native and back. I hope overhead will be minimum. Thank you for providing valuable info.

stephentoub commented 11 months ago

Was it considered or debated an eventual possibility to introduce a new regex option that can align to C++ standard?

We have no such plans. There's also not just one such syntax supported in C++, but multiple with varying syntaxes and capabilities.