Finalize decision on RegexOptions.IgnoreCase case-equivalence function

stephentoub commented 2 years ago

In .NET 7 we've changed how IgnoreCase is implemented, and in doing so we've taken some corner case breaking changes. Given that, we have an opportunity to push those breaking changes a bit further and further tweak the approach, in particular what do we consider to be case-equivalent. As of now, the implementation in main says that two characters are equivalent if one lower cases to the other or they both lower case to the same character; this was chosen primarily for compat, as it's the closest equivalence function to what was used in .NET 6 and earlier (namely that two characters were considered equivalent if they both ToLower'd to the same character). But there are other options we can consider:

Using upper casing instead of lower casing
Using both upper casing and lower casing
Use CompareInfo.Compare
Using the case-folding tables that are part of the Unicode standard

Whatever the choice, the work is straightforward: we'll just update with the decided-upon equivalence function the tool that generates the casing tables and re-run it.

cc: @GrabYourPitchforks, @joperezr

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details

In .NET 7 we've changed how IgnoreCase is implemented, and in doing so we've taken some corner case breaking changes. Given that, we have an opportunity to push those breaking changes a bit further and further tweak the approach, in particular what do we consider to be case-equivalent. As of now, the implementation in main says that two characters are equivalent if one lower cases to the other or they both lower case to the same character; this was chosen primarily for compat, as it's the closest equivalence function to what was used in .NET 6 and earlier (namely that two characters were considered equivalent if they both ToLower'd to the same character). But there are other options we can consider: - Using upper casing instead of lower casing - Using both upper casing and lower casing - Using the case-folding tables that are part of the Unicode standard Whatever the choice, the work is straightforward: we'll just update with the decided-upon equivalence function the tool that generates the casing tables and re-run it. cc: @GrabYourPitchforks, @joperezr

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

danmoseley commented 2 years ago

What do other engines do?

Tornhoof commented 2 years ago

Isn't that decision also coupled to RegexOptions.CultureInvariant?

stephentoub commented 2 years ago

Isn't that decision also coupled to RegexOptions.CultureInvariant?

Not really. They'd all respect RegexOptions.CultureInvariant, which just dictates whether CultureInfo.InvariantCulture should be used instead of CultureInfo.CurrentInfo.

What do other engines do?

It varies, and to my knowledge no one else changes based on a CultureInfo equivalent.

joperezr commented 2 years ago

@stephentoub I took another look at this to make sure the decision we took was the right one. Here are some of the notes I have for the other options:

Using upper casing instead of lower casing

In terms of number of behaviors across cultures and also number of behaviors against NLS vs ICU we have fairly the same number of behaviors to support than with lower casing. The main advantage of using lower casing for us is that we limit the amount of a breaking change this is from previous versions.

Using both upper casing and lower casing

In terms of number of behaviors across cultures this would produce the same number than our current approach, but the problem here is when you consider NLS vs ICU split. Currently, we only have around 3 special characters that vary with respect to NLS and ICU casing behavior, but if we were to also consider upper casing we would get 15 different behaviors, which becomes less ideal and harder to support.

Use CompareInfo.Compare

The problem with doing this is that this behavior doesn't match Unicode's and it is not guaranteed to stay the way it is today. There are also many different behaviors for it depending on the CultureInfo being used.

Using the case-folding tables that are part of the Unicode standard

This would be very similar to using upper casing and lower casing as mentioned in option number 2. The problem here is the difference when working with NLS vs ICU as that would render more inconsistencies.

Given the above, I do feel that the approach we have now is what we should go with for .NET 7, as it is:

Defendable: easy to reason about
Minimal breaking: it closely matches what we did before, so there are less breaking changes.
Less number of behavior special cases: Both between cultures and NLS vs ICU modes.

@stephentoub if you agree, is it ok to close this issue now? Or do you want me to investigate some other options?

stephentoub commented 2 years ago

I'm fine with sticking with what we have.

@GrabYourPitchforks, @tarekgh, strong opinions to the contrary?

tarekgh commented 2 years ago

I am fine sticking with what we have too.

joperezr commented 2 years ago

Thanks for confirming. Closing this for now.

dotnet / runtime

Finalize decision on RegexOptions.IgnoreCase case-equivalence function #69039