dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

StringComparison.OrdinalIgnoreCase does not consistently match ToUpperInvariant #67873

Open GrabYourPitchforks opened 2 years ago

GrabYourPitchforks commented 2 years ago

Per the following two documents, comparing strings using StringComparison.OrdinalIgnoreCase is explicitly documented as equivalent to calling ToUpperInvariant on each string, then performing an ordinal comparison against the contents.

These statements within the docs imply that ToUpperInvariant and ToLowerInvariant are ordinal case conversions ("simple case mapping"), not linguistic case conversions. However, it looks like we're not consistenly following this pattern.

string s1 = "s";
string s2 = "\u017f"; // Latin Sharp S, which uppercase-maps to a normal ASCII "S"
Console.WriteLine(s1.Equals(s2, StringComparison.OrdinalIgnoreCase)); // False
Console.WriteLine(s1.ToUpperInvariant() == s2.ToUpperInvariant()); // True

This has collateral impact. For example, recent PRs like https://github.com/dotnet/runtime/pull/67758 assume that non-ASCII characters cannot case-map to ASCII characters, which is not a guarantee offered by Unicode, but which might be a guarantee we'd be willing to make separately within the runtime by munging the Unicode tables.

See also https://github.com/dotnet/runtime/issues/30960 for further discussion on case mapping as a more general Unicode concept.

/cc @tarekgh, who had thoughts on this offline.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
Per the following two documents, comparing strings using `StringComparison.OrdinalIgnoreCase` is explicitly documented as equivalent to calling `ToUpperInvariant` on each string, then performing an ordinal comparison against the contents. - https://docs.microsoft.com/dotnet/api/system.stringcomparer.ordinalignorecase#remarks - https://docs.microsoft.com/dotnet/standard/base-types/best-practices-strings#ordinal-string-operations These statements within the docs imply that _ToUpperInvariant_ and _ToLowerInvariant_ are ordinal case conversions ("simple case mapping"), not linguistic case conversions. However, it looks like we're not consistenly following this pattern. ```cs string s1 = "s"; string s2 = "\u017f"; // Latin Sharp S, which uppercase-maps to a normal ASCII "S" Console.WriteLine(s1.Equals(s2, StringComparison.OrdinalIgnoreCase)); // False Console.WriteLine(s1.ToUpperInvariant() == s2.ToUpperInvariant()); // True ``` This has collateral impact. For example, recent PRs like https://github.com/dotnet/runtime/pull/67758 assume that non-ASCII characters cannot case-map to ASCII characters, which is [not a guarantee offered by Unicode](https://unicode.org/faq/casemap_charprop.html), but which might be a guarantee we'd be willing to make separately within the runtime by munging the Unicode tables. See also https://github.com/dotnet/runtime/issues/30960 for further discussion on case mapping as a more general Unicode concept. /cc @tarekgh, who had thoughts on this offline.
Author: GrabYourPitchforks
Assignees: -
Labels: `area-System.Globalization`
Milestone: -
ericstj commented 1 month ago

@tarekgh is there anything planned for this issue in .NET 9.0?