dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

Incorrect String Comparison Results with ICU #101422

Open danstur opened 6 months ago

danstur commented 6 months ago

Description

With the switch to the ICU library string comparisons do not work as expected. The behavior also differs from what ICU should generally return.

In a case insensitive comparison SS and ß do not compare equal. According to the current Unicode case folding rules they should be equal as I understand it: 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S (CaseFolding.txt)

Also checking with the ICU Unicode String Comparison it says that the result should be equal.

It seems to me that somehow the specific ICU library used with .NET 6/8 has a bug or that it is used incorrectly.

See also https://stackoverflow.com/questions/78371156/%c3%9f-ss-for-case-insensitive-comparison-with-icu and https://stackoverflow.com/questions/78364649/why-does-%c3%9f-equalsss-stringcomparison-currentcultureignorecase-differ-betw

Reproduction Steps

"ß".Equals("SS", StringComparison.CurrentCultureIgnoreCase); // returns false when using ICU library // The same is true for Contains and IndexOf

Expected behavior

The above code should return true.

Actual behavior

The above code returns false when the ICU library is used. Setting

<ItemGroup>
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
</ItemGroup>

the code returns true as expected

Regression?

A regression when compared to any < .NET 5 under Windows.

Known Workarounds

Specify System.Globalization.UseNls" Value="true". Sadly this only works under Windows and does not help with other platforms.

Configuration

Dotnet SDK: 8.0.104 OS: Windows 11 22H2 (22621.3447) Architecture: x64

Other information

No response

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

jkotas commented 6 months ago

Duplicate of https://github.com/dotnet/runtime/issues/20599#issuecomment-374791597

tarekgh commented 6 months ago

Yes, this is a duplicate. @danstur you may try https://github.com/dotnet/runtime/issues/20599#issuecomment-374791597 to get the desired behavior. Feel free to send any more question if you still have any. Thanks for your report.

danstur commented 6 months ago

@tarekgh Can you explain the need for IgnoreNonSpace which says "Indicates that the string comparison must ignore nonspacing combining characters, such as diacritics" and that rather more awkward API?

According to the ICU library ß should be case insensitive equal to ss independent of language, so it's not that I need some collation for this to work. And neither ß nor ss contain any "nonspacing combining characters" but are all single codepoints and ß doesn't have any decompositions listed. Looking at the "based on s" list (https://www.compart.com/en/unicode/U+0073) there's also no ß in that list.

Looking at the stackoverflow post, the commenters are all just as confused by the current behavior as I am.

If this is the expected behavior, it'd be great if you could point to some more exhaustive documentation that explains exactly how the normal String APIs behave. I honestly couldn't say what the expected behavior of Equals(..., StringComparison.CurrentCultureIgnoreCase) is in .NET 8.

Another-Ralf commented 6 months ago

What about documenting that behavior? Like here https://learn.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings#linguistic-comparisons.

The docu in its examples is talking explicitly about the "ß" vs. "SS" case implies looking at comparison overloads using StringComparison but that seems a dead end, at least to me, to get the expected standard behavior. None of the StringComparison option translates to or includes the needed CompareOptions.IgnoreNonSpace right?

tarekgh commented 6 months ago

To give some more info why you are seeing this behavior, ICU collation work using what it is called collation strength. Strength can be Primary, Secondary, Tertiary, or Quaternary. We are trying to map as much as we can the .NET comparison options to one of these strength. which work fine except in such special cases. Unfortunately, ICU make ß equals only to ss if having the ICU strength is primary. We cannot switch to that strength by default in .NET because is going to break many other things. Even I saw before users complaining on ICU because of this case. The work around in .NET is to use other comparison option IgnoreNonSpace which will cause ICU to use the primary strength. You can play with that in the Collation Demo

I reactivated this issue to track collation customization for this special case of ss and ß .

danstur commented 6 months ago

@tarekgh Thanks for the explanation, Unicode once again being even more complicated than anticipated.

Is there any documentation what collation strength the values comparison options use and how I can figure out at what collation strength two characters compare equal? Or do I just throw them into the collation demo and check that way what happens? I guess for my use cases that's fine too.

Ah Unicode causing headaches as usual.

tarekgh commented 6 months ago

We don't document the mapping between the .NET options and ICU collation strength as this more implementation details. But you can see the mapping here https://github.com/dotnet/runtime/blob/40bc2d88b6606324f5774cc972d480a1d26084f8/src/native/libs/System.Globalization.Native/pal_collation.c#L276 if you are interested.

Note, the default strength is always Tertiary.

andjc commented 6 months ago

As far as I can tell, its working as expected. You are using a collation comparison, not a comparison of case folded strings. The Unicode Collation Algorithm, follows the relevant DIN standard for sorting ß.

The UCA maintains compatibility with the DIN standard for sorting German by having the German sharp-s (U+00DF (ß) LATIN SMALL LETTER SHARP S) sort as a secondary difference with "SS", instead of having ß and SS match at the secondary level.

If you are comparing strings using full case folding, they will match. If you are comparing using simple casefolding, they will not match. If you compare using a collator set to Primary strength they will match. If you use a different strength for the collator, they will not match.

Yes, under full casefolding, they should be treated the same, but they are intended to sort differently.

tarekgh commented 6 months ago

@andjc Indeed, we recognize that this behavior aligns with what Unicode collation defines. However, the issue arises from the fact that .NET Framework previously utilized Windows collation (NLS), where ss equated to ß. With the transition to ICU in .NET Core, some users have voiced concerns about this new behavior. The question at hand is whether we should adhere to the Unicode collation behavior or make a special case for this particular scenario.

andjc commented 6 months ago

I assume there are other differences as well.

As a rule I use multiple operating systems and multiple programming languages.

System based locale data and locale operations differ across implementations. I use ICU when I want consistent results across different platforms and different programming languages.

ICU by default uses the CLDR collation Algorithm, what is referred to as the root collation, and tailors that as required per locale, with some locales having multiple collation tailorings.

Changing the ICU collation to match NLS breaks that benefit of ICU. It also raises the question of whether German collation should be system depended, ie using NLS rules on windows and using platform specific rules on other platforms, ie divergence of results based on platform.

What comes to mind is that ICU supports multiple tailorings, including for German, ie standard vs phonebook style collation. This can be enabled by using a variant locale, either using POSIX or BCP47 identifiers. Given that alternative collation rules are already available, a logical approach would be to add another locale variant to kick in NLS compatible collation.

That way you can retain icu's collation and add a tailored collation that changes the collation weight of ß to match previous implementations.

Although considering above it was noted that the collation strength is set to tertiary, how do you handle Japanese, I was under the impression that the excel and other apps used a sort that would be equivalent to a QUATERNARY strength.

tarekgh commented 6 months ago

What comes to mind is that ICU supports multiple tailorings, including for German, ie standard vs phonebook style collation. This can be enabled by using a variant locale, either using POSIX or BCP47 identifiers. Given that alternative collation rules are already available, a logical approach would be to add another locale variant to kick in NLS compatible collation.

Regrettably, the behavior of equating 'ss' to 'ß' functioned uniformly across all locales in NLS, which is what users requesting this behavior. As previously noted in this issue, there exists a workaround for users who wish to utilize it, namely employing the 'IgnoreNonSpace' comparison option.

In principle, we endeavor to adhere closely to CLDR/Unicode behavior, which is why no action has been taken on this issue thus far. However, users persist in expressing discontent regarding it. Mostly users used .NET Framework for awhile.