dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.74k forks source link

Javascript RegEx could not be used in C# cause of not supported "--" in #79982

Open jogibear9988 opened 1 year ago

jogibear9988 commented 1 year ago

Browsers now support "--" in Regex, see: https://v8.dev/features/regexp-v-flag#difference

so for example this works:

  ^[_--[0-9]]+$

but this would not work in C#, cause "--" is not supported. Would it be possible to add this?

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
Browsers now support "--" in Regex, see: https://v8.dev/features/regexp-v-flag#difference so for example this works: ^[_--[0-9]]+$ but this would not work in C#, cause "--" is not supported. Would it be possible to add this?
Author: jogibear9988
Assignees: -
Labels: `area-System.Text.RegularExpressions`, `untriaged`
Milestone: -
jogibear9988 commented 1 year ago

also another difference:

capturegroup names in javascript could for example be "$" or unicode characters. In C# it's not allowed

danmoseley commented 1 year ago

also another difference

Please open a separate issue for this if you want to propose it.

In general it's not a goal to support everything other engines do; they are all different, although .NET is broadly a subset of Perl flavor, it has its own features.

To evaluate a feature request then, considerations would likely include

stephentoub commented 1 year ago

Browsers now support "--"

The feature this refers to is subtraction, which .NET's regex already supports, just with a single - instead of --. For example, [a-z-[m-p]] is the same as [a-lq-z], i.e. all the letters a through z except for m through p.

jogibear9988 commented 1 year ago

@stephentoub I know that c# already supports it. They issue is more about, if we additionally support more syntax, so javascript regex could be used in c# as well. I got this issue while running test262 testsuite against esprima.net javascript parser, cause it uses directly net regexes instead of a own javascript engine.

There are a few more issues, for example capture group names and maybe more. So here I wanted to ask, would we work on supporting more of the javascript regex syntax?

stephentoub commented 1 year ago

They issue is more about, if we additionally support more syntax, so javascript regex could be used in c# as well.

There are tons of minute differences between regex syntaxes across languages. https://davisjam.medium.com/why-arent-regexes-a-lingua-franca-esecfse19-a36348df3a2 is a nice paper highlighting how incompatible regexes actually are across platforms.

jogibear9988 commented 1 year ago

@stephentoub know that there are many differences, the question is, are tickets/issues/pull requests to remove them allowed, planed to resolve? Or is this not an option?

jogibear9988 commented 1 year ago

As for example, using esprima-net (https://github.com/sebastienros/esprima-dotnet), or jint (https://github.com/sebastienros/jint) i think for this projects it should be a win if regexes wich work in javascript also work in their engines.

stephentoub commented 1 year ago

Every such change is almost certainly a breaking change, e.g. if I change your example to ^[!---[0-9]]+$, that's already valid syntax and means something different (the range between ! and - without the digits 0 through 9). There would need to be very strong justification for breaking existing expressions, and making the syntax closer to that used by another language (and further from that used by other languages) is not strong-enough justification.

lahma commented 1 year ago

I wonder whether the RegexOptions.ECMAScript would allow such changes in behavior, based on documentation it's meant to support ECMAScript behavior after all and as end user I would expect that JS Regexes would work somewhat similarly. It has a bold sales pitch:

Enables ECMAScript-compliant behavior for the expression

I do understand the worry about breaking changes, maybe a new option like ESNext would be needed for cutting-edge behavior 🙂

jogibear9988 commented 1 year ago

Never saw (and tried) the ECMAScript option, but as you said, if it's set I think we then should support the same regexes. Maybe we should check what of the Test262 Regexes do not work (and disable also our hacks). So we could create an issue what needs to be fixed.

cyraid commented 1 year ago

I'm not sure if @stephentoub knew about the RegexOptions.ECMAScript, but the problem with adding ESNext option would be, what new option would have to be added for the next version? IMO, ECMAScript should mean just that.. If you have it on, your regex should work in ECMAScript compliants mode. If you have a Javascript, and the interpreter gets upgraded, would your existing script be no longer working? I'm sure ECMAScript also has regex backwards compatibility. That's just my 2 cents.

danmoseley commented 1 year ago

Changing ECMAScript mode would involve the same breaking change concerns. Apps break when customers upgrade. We'd have to have strong reason and convince ourselves that very few apps uses pattern that would be broken.

jogibear9988 commented 1 year ago

I don't see a point for EcmaScript mode, if it don't run ecmascript regex...

cyraid commented 1 year ago

I don't see a point for EcmaScript mode, if it don't run ecmascript regex...

Exactly. Using a special compliants mode would be your opt-in for that behavior whatever it entails. Using the normal regex from c# would be the case I would be hesitant to change.

Customers working around a solution that is broken, would be happy to have the ecmascript mode working as intended and removing their workarounds to make it work, no? Again, my $0.02.

steveharter commented 1 year ago

Closing; based on above conversation the -- is rarely used and at this point not worth a breaking change.

We can re-open this if we get additional asks here.

lahma commented 1 year ago

The whole EcmaScript mode is basically broken when it comes to any JS feature released in last ten years (or more) so maybe the problem is still present?

steveharter commented 1 year ago

Just to clarify that the request for this issue is to support more EcmaScript features, including --. This could be implemented by using the existing RegexOptions.ECMAScript option but that would still be breaking. We could add a new flag if that is better.

FYI: the current EcmaSpec behavior: https://learn.microsoft.com/dotnet/standard/base-types/regular-expression-options#ecmascript-matching-behavior

steveharter commented 1 year ago

Re-opened and moved to Future; not clear what the priority of adding additional EcmaScript support is.

jogibear9988 commented 1 year ago

When I search github for RegexOptions.ECMAScript I found over 3600 files includeing it, so the chance of breaking something is not as low as I thought, so I think best way would be to introduce a new flag maybe (if this would be done).

And if it would be done, it would be nice if all the test262 regexp tests would be run against. Maybe this then could be achived via jint