Closed stephentoub closed 2 years ago
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions` |
Milestone: | 7.0.0 |
What do other engines do?
Isn't that decision also coupled to RegexOptions.CultureInvariant?
Isn't that decision also coupled to RegexOptions.CultureInvariant?
Not really. They'd all respect RegexOptions.CultureInvariant, which just dictates whether CultureInfo.InvariantCulture should be used instead of CultureInfo.CurrentInfo.
What do other engines do?
It varies, and to my knowledge no one else changes based on a CultureInfo equivalent.
@stephentoub I took another look at this to make sure the decision we took was the right one. Here are some of the notes I have for the other options:
Using upper casing instead of lower casing
In terms of number of behaviors across cultures and also number of behaviors against NLS vs ICU we have fairly the same number of behaviors to support than with lower casing. The main advantage of using lower casing for us is that we limit the amount of a breaking change this is from previous versions.
Using both upper casing and lower casing
In terms of number of behaviors across cultures this would produce the same number than our current approach, but the problem here is when you consider NLS vs ICU split. Currently, we only have around 3 special characters that vary with respect to NLS and ICU casing behavior, but if we were to also consider upper casing we would get 15 different behaviors, which becomes less ideal and harder to support.
Use CompareInfo.Compare
The problem with doing this is that this behavior doesn't match Unicode's and it is not guaranteed to stay the way it is today. There are also many different behaviors for it depending on the CultureInfo being used.
Using the case-folding tables that are part of the Unicode standard
This would be very similar to using upper casing and lower casing as mentioned in option number 2. The problem here is the difference when working with NLS vs ICU as that would render more inconsistencies.
Given the above, I do feel that the approach we have now is what we should go with for .NET 7, as it is:
@stephentoub if you agree, is it ok to close this issue now? Or do you want me to investigate some other options?
I'm fine with sticking with what we have.
@GrabYourPitchforks, @tarekgh, strong opinions to the contrary?
I am fine sticking with what we have too.
Thanks for confirming. Closing this for now.
In .NET 7 we've changed how IgnoreCase is implemented, and in doing so we've taken some corner case breaking changes. Given that, we have an opportunity to push those breaking changes a bit further and further tweak the approach, in particular what do we consider to be case-equivalent. As of now, the implementation in main says that two characters are equivalent if one lower cases to the other or they both lower case to the same character; this was chosen primarily for compat, as it's the closest equivalence function to what was used in .NET 6 and earlier (namely that two characters were considered equivalent if they both ToLower'd to the same character). But there are other options we can consider:
Whatever the choice, the work is straightforward: we'll just update with the decided-upon equivalence function the tool that generates the casing tables and re-run it.
cc: @GrabYourPitchforks, @joperezr