Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms

GrabYourPitchforks commented 5 years ago

Proposal

.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase). The current implementation of this API is to call ToUpperInvariant on each string, then compare the resulting uppercase strings for bitwise equality.

.NET Core also provides methods to convert chars, Runes, and strings to uppercase or lowercase using the "invariant" culture (ToUpperInvariant / ToLowerInvariant). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.

I propose changing the logic so that .NET Core carries its own copy of the ICU "simple" casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an OrdinalIgnoreCase comparison, and it would affect string casing only when using CultureInfo.InvariantCulture.

Justification

Today, when processing UTF-8 data in a case-insensitive manner (such as via Equals(..., OrdinalIgnoreCase)), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we'd be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as string.ToUpperInvariant and string.ToLowerInvariant since we'd be able to optimize those calls.

As mentioned earlier, today's casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)

Here are some concrete examples demonstrating the problems:

// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S

string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"

With this proposal, the code above will behave the same across all OSes. They would follow what is today's non-Windows behavior. They'd be locked to whatever version of the Unicode data we include in the product as part of the CharUnicodeInfo class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the CharUnicodeInfo class follows the Unicode Standard 11.0.0.

Breaking change discussion

Affected APIs:

string / char / Rune equality methods or hash code generation routines which take StringComparison.OrdinalIgnoreCase as a parameter. All other comparisons are unchanged.
string / char / Rune case changing methods when CultureInfo.InvariantCulture is provided. All other cultures are unchanged.
Extension methods on ReadOnlySpan<char> which provide equivalent functionality to the above.
StringComparer.OrdinalIgnoreCase. All other StringComparer instances are unchanged.
Case changing methods on CultureInfo.InvariantCulture.TextInfo.

If GlobalizationMode.Invariant is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.

Applications which depend on OrdinalIgnoreCase equality being stable may be affected by this proposed change. That is, if an application relies on "ß" and "ẞ" being not equal under an OrdinalIgnoreCase comparer, that application is likely to experience odd behavior in the future.

In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:

string a = "ꝍ"; // U+A74D
string b = "Ꝍ"; // U+A74C

// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);

Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it's using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.

There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using OrdinalIgnoreCase might compare as not equal using InvariantCultureIgnoreCase. For example:

// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase);

// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("ẞ", StringComparison.InvariantCultureIgnoreCase);

I don't expect this to trip up most applications because I don't believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.

This may also lead to a discrepancy between managed code which uses StringComparison.OrdinalIgnoreCase and unmanaged code (including within the runtime) which uses CompareStringOrdinal on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.

stephentoub commented 5 years ago

What would be the size impact?

GrabYourPitchforks commented 5 years ago

@stephentoub To System.Private.CoreLib.dll? Around 3 KB to carry the data.

Edit: This number is derived from adding 4 casing tables to CharUnicodeInfoData.cs. Each table is a RVA static ROS<byte> consisting of approximately 200 32-bit integers each. 200 entries per table 4 bytes per entry 4 tables = approx. 3 KB. We can reuse the 8:4:4 Level1 / Level2 / Level3 table structure already present in that file.

danmoseley commented 5 years ago

cc @tarekgh

Just for my own education, why do we carry tables at all rather than calling into the OS/ICU? Performance? Or code that must run before ICU loaded?

tarekgh commented 5 years ago

Just for my own education, why do we carry tables at all rather than calling into the OS/ICU? Performance? Or code that must run before ICU loaded?

There are some Unicode data Windows doesn't expose. so we had to carry such tables. When we fully switch to use ICU, we can look at not carrying these table but I don't think we can do that today as we need to run on OS versions not having ICU.

scalablecory commented 5 years ago

How will sort keys work?

CultureInfo.InvariantCulture.CompareInfo.GetSortKey(str, CompareOptions.OrdinalIgnoreCase)

Will this be incompatible with String's sorting, or will it use the same data?

GrabYourPitchforks commented 5 years ago

@danmosemsft Want to make Windows 10, 1703 a minbar for .NET Core 5? :) That exposes the APIs we'd need.

@scalablecory CultureInfo.InvariantCulture.CompareInfo is unchanged. This proposal instead affects the behavior of CultureInfo.InvariantCulture.TextInfo.ChangeCase. I misspoke in the original issue description and will clarify it shortly.

tarekgh commented 5 years ago

GetSortKey doesn't allow passing CompareOptions.OrdinalIgnoreCase. It throws.

GrabYourPitchforks commented 5 years ago

Here's a bit more on this, to help fill in some missing details. The Unicode Standard defines four types of "simple" case conversion for every code point: toLower, toUpper, toTitle, and toCaseFold. toLower and toUpper are fairly self-explanatory. I'll skip toTitle since it's not particularly relevant to this discussion. toCaseFold is interesting for reasons we'll see shortly.

First, a quick definition. Simple case conversion simply means that a single code point case-maps to a single code point. This is contrast to full case conversion, where a single code point might case-map to multiple code points. In .NET, our ToUpper and ToLower routines perform simple case conversion, as demonstrated below:

// 'ﬃ' is U+FB03 LATIN SMALL LIGATURE FFI

string toUpperSimple = "ﬃ".ToUpperInvariant();
string toUpperFull = "ﬃ".ToUpperInvariantFull(); // assume such a method exists

Console.WriteLine(toUpperSimple); // prints "ﬃ" (the single-code point string)
Console.WriteLine(toUpperFull); // prints "FFI" (the three-code point string)

Unicode defines no single code point which represents a "large" version of the FFI ligature. So the ToUpperInvariant framework API no-ops on the input (since it uses simple case mapping). If we had a theoretical API which instead performed full case mapping, it would return the three-character string "FFI", as shown above. This is also reflected in Unicode's CLDR data files via the Uppercase_Mapping field in the table at https://unicode.org/cldr/utility/character.jsp?a=FB03.

The big benefit of "simple" case mapping is that under UTF-16 or UTF-32, the output string has a length which exactly matches the input string's length. For example, an 8-char input string always results in an 8-char output string. This makes such APIs very easy to reason about. "Full" case mapping (which the Framework doesn't support) could instead result in an output string whose length is different than the input string, complicating application logic.

(In UTF-8, even under "simple" case mapping, the byte length of the output string isn't necessarily equal to the byte length of the input string. The details of this aren't particularly interesting to this discussion, but I wanted to call it out here for completeness.)

Now, to case folding.

For various reasons, certain code points do not map to their logical uppercase or lowercase versions. For example, even though the Unicode Standard defines an uppercase 'ẞ', the return value of "ß".ToUpperInvariant() is "ß", not "ẞ". (See Michael Kaplan's blog for the whole sordid backstory.) This means that if we naively define "case-insensitive" equality as "does a.ToUpperInvariant() equal b.ToUpperInvariant()?", certain strings like "ß" and "ẞ" will compare as not equal! And this is in fact how Windows defines the Win32 CompareStringOrdinal method.

To work around this, Unicode defines a case mapping called case fold mapping, which is a not-quite-uppercase / not-quite-lowercase mapping that can be applied to every scalar value within a string to generate a new string that can be compared in a case-insensitive fashion. That is, Unicode defines simple case-insensitive equality as "does a.ToSimpleCaseFold() equal b.ToSimpleCaseFold()?" Since "ß" and "ẞ" both fold to "ß" (see CaseFolding.txt), they should compare as equal under a case-insensitive comparison following Unicode's rules.

So, in summary, here's what the various Framework APIs are intended to mean, along with how this proposal will affect them:

StringComparison.Ordinal: Compare two strings for true bitwise equality. (No change as part of this proposal.)
StringComparison.OrdinalIgnoreCase: Perform simple culture-agnostic case folding of two strings, then compare the results for true bitwise equality. (Windows currently performs simple culture-agnostic uppercase conversion instead of simple case folding; this proposal seeks to modify the Framework's behavior to match the Unicode Standard instead of Windows's behavior.)
StringComparison.<AnythingElse>: Perform full normalization (and case folding if appropriate) of the input strings using the rules of the specified culture, then compare the results for true bitwise equality. These extra pre-processing steps are why "ß" and "ss" compare as equal under StringComparison.InvariantCulture, for instance, even though they're of different string lengths. (No change as part of this proposal.)
ToUpperInvariant / ToLowerInvariant: Perform simple case mapping of the input string to uppercase or lowercase, ignoring any culture-specific rules. (Windows currently omits entries from the tables provided by the Unicode Standard; this proposal seeks to modify the Framework's behavior to match the Unicode Standard instead of Windows's behavior.)
ToUpper / ToLower for any other culture: Perform simple case mapping of the input string using the rules of the specified culture. (No change as part of this proposal.)

nguerrera commented 5 years ago

Am I remembering correctly that matching Windows behavior was sold as beneficial for OrdinalIgnoreCase to match how the OS compares file names and other symbols like user names? Is that still relevant if so?

jkotas commented 5 years ago

compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase). The current implementation of this API is to call ToUpperInvariant on each string, then compare the resulting uppercase strings for bitwise equality.

Where is this code? I am not able to find it.

If GlobalizationMode.Invariant is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.

I think we should change GlobalizationMode.Invariant to match too. I believe that it would make GlobalizationMode.Invariant to be so much more usable.

I cannot think offhand of any components which do this

Reflection: Part of reflection is implemented in unmanaged runtime. The globalization in the unmanaged runtime is not 100% in sync with the managed BCL, and thus we have mismatch for some corner cases already. This can be fixed by rewriting the relevant parts of the reflection in C#.
Windows file system: How is the case-insensitive comparison done by Windows file system? Are there going to be situations where the mismatch would introduce "interesting" problems?

To System.Private.CoreLib.dll? Around 3 KB to carry the data.

Note that we carry this data in Unix PAL already: https://github.com/dotnet/coreclr/blob/master/src/pal/src/locale/unicodedata.cpp. There is opportunity to dedup this (on Unix at least).

GrabYourPitchforks commented 5 years ago

Am I remembering correctly that matching Windows behavior was sold as beneficial for OrdinalIgnoreCase to match how the OS compares file names and other symbols like user names? Is that still relevant if so?

@nguerrera Somewhat, but this is becoming a relic of a time where everything takes place on a single machine. Consider what I mentioned earlier where the Windows operating system updates their internal casing information with each release. In today's world your application is almost certainly talking to a machine on a network somewhere, and it's probably the case that the machine on the other end of the line is running a different operating system from you. It's almost certainly possible to craft two strings a and b where your machine and the remote machine you're talking to disagree on whether a and b are equal under an OrdinalIgnoreCase comparer.

Where is this code? I am not able to find it.

@jkotas It's implemented by the Win32 function CompareStringOrdinal.

How is the case-insensitive comparison done by Windows file system? Are there going to be situations where the mismatch would introduce "interesting" problems?

Depends on the file system. From discussions with @JeremyKuhne, NTFS can operate in a case-sensitive fashion, so it's already possible today to have two files whose names differ only in case. I'll dig up further information.

jkotas commented 5 years ago

The current implementation of this API is to call ToUpperInvariant on each string, then compare the resulting uppercase strings for bitwise equality.

It's implemented by the Win32 function CompareStringOrdinal.

This is not how Win32 function CompareStringOrdinal is implemented internally. To avoid confusion, it would be better to just say that this just p/Invokes Win32 on Windows and ICU elsewhere.

this is becoming a relic of a time where everything takes place on a single machine.

There are still many scenarios that take place on a single machine. For example, one can have a local disk cache - that is a perfectly reasonable thing to do even with cloud - and the implementation of the cache may depend on the CompareStringOrdinal to be in sync with the local NTFS file system for security.

GrabYourPitchforks commented 5 years ago

and the implementation of the cache may depend on the CompareStringOrdinal to be in sync with the local NTFS file system for security.

This is the thing that I don't think is correct. Say that an adversary can choose the filenames to be stored on the file system or the values of the primary keys to be stored in a database. I happen to know that you're running OS version vCurrent, but OS version vNext is releasing imminently. So I craft two strings which under vCurrent compare as not equal under a case-insensitive comparison, you happily commit them to storage, then you update to vNext and the strings suddenly start colliding. This is already a potential point of failure in today's world, even without the proposal here.

GrabYourPitchforks commented 5 years ago

BTW "no" is also an appropriate response to this proposal. :)

Part of this proposal was to speed up OrdinalIgnoreCase processing for UTF-8 data. There are other ways to do that. One crazy idea is to query the underlying OS and generate our own table the first time a UTF-8 OrdinalIgnoreCase equality comparison is requested. It's a bit unorthodox, so we'd have to show significant perf improvement before committing to such a plan, but at least the behavior would be identical to the underlying OS.

jkotas commented 5 years ago

This is already a potential point of failure in today's world, even without the proposal here.

Just trying to come up with a hypothetical worst-case example. If the on-disk cache was flushed on OS updates, this would work fine today.

tarekgh commented 5 years ago

Am I remembering correctly that matching Windows behavior was sold as beneficial for OrdinalIgnoreCase to match how the OS compares file names and other symbols like user names? Is that still relevant if so?

Usually the file systems snaps their own casing table to be isolated from the running system to ensure if you moved the disk from one system and mounted to other system will continue work regardless of the OS version. For User names, I am not sure be consistent with the OS will mitigate this issue. Imagine today someone created a User Name and then upgraded Windows with more casing support. How this will be different? Levi is the security guy here to advice :-)

Just trying to come up with a hypothetical worst-case example. If the on-disk cache was flushed on OS updates, this would work fine today.

Does the OS will use its own casing to flush the cache created by .NET?

GrabYourPitchforks commented 5 years ago

From a security perspective, as much as everybody hates to admit it, the absolute safest thing to do is to ensure that unique identifiers (user names, file names sans extension, etc.) consist only of the characters [A-Za-z0-9-_]. If this is impractical, the next safest thing is to ensure that unique identifiers consist only of known and assigned code points restricted to certain ranges. For example, servers written with security in mind should block username registration if the username contains any of the following code points.

U+30000, because as of this writing it is not assigned.
U+FFFE, because it is a non-character.
U+007F, because it is a C1 control code.
U+E000 because it is intended for private use.

The rules above are demonstrative and not intended to be exhaustive.

Due to the Unicode stability policy, since a code point's case folding is permanently locked at the time of its introduction to the standard, blocking code points such as those listed above would be sufficient to prevent introducing data which might not collide now but which has potential to collide in the future.

For applications which use ToUpperInvariant instead of simple case folding, blocking code points such as those listed above - and making sure that ToUpperInvariant is re-run over the existing persisted data if the OS ever changes - should be sufficient to prevent introduction of such data.

iSazonov commented 5 years ago

Moving to ICU is great plan! As PowerShell Core repo maintainer I am thinking about remoting scenarios. Today users can connect from any OS to any OS (Windows, Linux, MacOs). If PowerShell will be based on .Net Core 5.0 and ICU on all OSes all remote scenarios will be "consistent by design"! It is very important because PowerShell is for management that assumes sometimes manipulating sensitive data.

I guess not all projects will be ready to move to ICU so it would be nice to have opt-in on early preview and opt-out later to switch back to NLS. It is important for testing too. In PowerShell repo we could run tests for ICU and then for NLS to ensure that they pass both and our users is not break.

I am interesting in using SCF in PowerShell Core for performance reasons (possible scenarios: identifier and file names comparisons, log searching...). Preserving string lengths and using table-based mapping allow us to achieve performance comparable to Ordinal. I already made SCF implementation in CoreFXlab repo. I started with 3-level table mapping as @tarekgh adviced. I found this very slow and switch to 2-level table with good results (near Ordinal!). Of course, the size of the table is no longer 3 Kb but more, but if performance is important, we could accept this.

/cc @joeyaiello @SteveL-MSFT @daxian-dbw @mklement0 for information.

danmoseley commented 5 years ago

Cc @mjsabby

mjsabby commented 5 years ago

cc @timmydo

As long as we're converging on ICU and can have it be on Windows, that sounds good.

GrabYourPitchforks commented 5 years ago

Follow-up question to anybody interested: for the specific APIs mentioned here (OrginalIgnoreCase comparison and ToUpperInvariant / ToLowerInvariant), is it acceptable for a particular .NET Core version to be tied to a particular Unicode version, or do folks need to be able to swap out the version of ICU used under the covers?

That is, it is acceptable to say ".NET 5's implementation of these specific methods uses data from the Unicode Standard 12.1" (and later versions of the framework would be pegged to later standards)? Or is there a requirement to always use the latest available data available from the local system? There are obviously trade-offs involved, but I'd like to gauge the community's pulse on this.

tarekgh commented 5 years ago

Or is there a requirement to always use the latest available data available from the local system?

I am inclining to this option more. I am think in the near future we should fully use ICU for all needed functionality across all OS's. Also, there will be some consistency between the native and managed apps.

jkotas commented 5 years ago

I am think in the near future we should fully use ICU for all needed functionality across all OS's

What are we going to do about the Windows version that do not include ICU? I doubt that we are going to drop the support for those in near future.

tarekgh commented 5 years ago

What are we going to do about the Windows version that do not include ICU? I doubt that we are going to drop the support for those in near future.

We are talking to Windows team who are going to publish a NuGet packages for ICU. These packages can be used on Windows down level versions. Also, we are going to fallback to NLS in case we couldn't find any ICU on the system. I am going to write some doc with more details and I'll share it as soon as I have it ready.

iSazonov commented 5 years ago

Will InvariantCulture be the same as locale "C"/"POSIX" ?

tarekgh commented 5 years ago

Will InvariantCulture be the same as locale "C"/"POSIX" ?

I am not sure what you mean by be the same, in 3.0 we already map "C" locale to InvariantCulture. "C" locale behavior is not desirable at all and that is why we did this mapping.

iSazonov commented 5 years ago

@tarekgh Sorry, my question was about Windows.

tarekgh commented 5 years ago

my question was about Windows.

Yes, "C" on Windows is mapped to Invariant too (in net core 3.0).

jaykrell commented 4 years ago

For reference, "applications" that require stable mappings, carry them themselves, even recording them with the data sets.

For example, NTFS stores an upcase table on volumes when it formats them.

And btw this kinda contradicts: https://github.com/dotnet/corefx/issues/41333#issuecomment-535566914

You'd have to read the upcase table from the relevant volume.

(and indicated in https://github.com/dotnet/corefx/issues/41333#issuecomment-535601504)

mjsabby commented 4 years ago

@GrabYourPitchforks does it still make sense to make Invariant Mode OrdinalIgnoreCase not look at non-ASCII given it is in managed code? Given that we're now carrying the C# data? Should this proposal be updated?

I would love to have our application run in Invariant mode but still be able to do OrdinalIgnoreCase.

GrabYourPitchforks commented 4 years ago

@mjsabby We're not yet carrying the C# data. We're also still doing case comparisons incorrectly, even when invariant mode is not enabled, since we're always doing a toupper conversion rather than a case folding conversion. In an ideal world, this is what I'd want to happen:

We introduce a proper case folding API (tracked by https://github.com/dotnet/runtime/issues/20674)
We change string.Equals(..., OrdinalIgnoreCase) to use case folding as prescribed by Unicode (tracked by this issue)
We carry our own data when invariant globalization mode is enabled (tracked by https://github.com/dotnet/dotnet-docker/issues/1877)

tarekgh commented 4 years ago

I would to say, it is better to define ordinal case by the casing provided in UnicodeData.txt and not to change that. In the future, we can expose case folding casing/comparison APIs. The apps will have the flexibility to choose the way want the comparison to be performed. The reason here is case folding can be more complicated and sometime users want to choose to do case folding using specific culture which will need normalization at that time. We should think about the case folding from more wider angle and not constrain us to ordinal operations.

GrabYourPitchforks commented 4 years ago

The reason here is case folding can be more complicated and sometime users want to choose to do case folding using specific culture which will need normalization at that time.

You're right, but I was thinking that "OrdinalIgnoreCase" comparisons would always use simple / non-culture-aware folding. This is already the recommendation from Unicode regarding how to perform caseless matching (see the case mapping FAQ and Unicode Ch. 5, Sec. 5.18).

tarekgh commented 4 years ago

I was thinking that "OrdinalIgnoreCase" comparisons would always use simple / non-culture-aware folding

The only concern I have with that is Ordinal by name indicating 1:1 mapping/comparison. Case folding will not guarantee that. This is why I am preferring new APIs for case folding. Anyway, we'll visit this in later releases and we can come up with the best proposal together :-)

iSazonov commented 4 years ago

I'd say that following to Unicode standard is better than splitting API while we have a switch to old - this last will confuse users and open a way to have applications with mixing old and new case folding that is my main concern.

talktovishal commented 3 years ago

hi @GrabYourPitchforks ,

Do you have any recommendations for someone in this situation? In our situation, we used the ToUpperInvariant() with training a machine learning model. Now when we use the same model in a Linux env, when we run ToUpperInvariant(), we get different response for the same input character which impacts how the model behaves. Our model training and hosting envs are mixed mode, Windows in training, Linux and Windows in hosting. What can we do to port the Windows mapping to Linux? Can we use the case folding table in this case?

Example: On windows:

$a= "ς" $a.ToUpperInvariant() ς

On Linux/WSL

$a ="ς" $a.ToUpperInvariant() Σ

What we would like is the Linux call to also return the same char as Windows.

tarekgh commented 3 years ago

@talktovishal I assume you are not using .NET 5.0. right? if so, would it be good if you can migrate to .NET 5.0 which I expect you will get the exact behavior when running on Windows or Linux.

talktovishal commented 3 years ago

hi @tarekgh ,

Thanks for your response. Your observation is correct. We haven't migrated to .Net 5.0 yet. We have plans to do so.

Do you have any recommendations for the models that are already trained with .net Fx 4.*? What do we do in that case? I want the same mappings from Windows to work in Linux. How can I achieve that for the short term while we migrate and re-train our models?

tarekgh commented 3 years ago

@talktovishal I don't think there anyway guarantee to get the NETFX casing behavior when running on Linux except if you generate your own casing table from Windows and then manually doing the casing operation using the generated casing table. if you want, I can try to provide some code doing that if this is feasible option to you and how urgent you'll need that?

talktovishal commented 3 years ago

@tarekgh if you can provide sample code, that would be great. We have a deadline in the end of next week and hence we are trying to figure out what are our options. Really appreciate your help and prompt response.

tarekgh commented 3 years ago

@talktovishal I'll try to provide you the code in next couple of days.

tarekgh commented 3 years ago

@talktovishal please try the code https://gist.github.com/tarekgh/de1a1b29b03d048ad580c0a338c44fbb and let me know if you have any question. Note, I did some quick testing on it so I hope you can do some more testing. The behavior should match the full framework behavior.

usage would be something like

    string uppercasedString = "abcd".NetFxToUpperInvariant();
    char uppercasedChar = 'a'.NetFxToUpperInvariant();

talktovishal commented 3 years ago

Wow, this is great @tarekgh ! Can you please give a summary of the approach? I will def do some testing at my end.

tarekgh commented 3 years ago

I just run string.ToUpperInvariant on all Unicode characters range to collect all cased characters and then from this list I created the 8-4-4 table (for the sake optimizing the size of the table) which can be used for the lookup. if you want to see how to generate such table you may look at https://gist.github.com/tarekgh/55dfaf0f44689738c3a6ca67941ccdc2#file-casefolding-cs-L87.

aaronpburke commented 3 years ago

@talktovishal please try the code https://gist.github.com/tarekgh/de1a1b29b03d048ad580c0a338c44fbb and let me know if you have any question. Note, I did some quick testing on it so I hope you can do some more testing. The behavior should match the full framework behavior.

usage would be something like
    string uppercasedString = "abcd".NetFxToUpperInvariant();
    char uppercasedChar = 'a'.NetFxToUpperInvariant();

FYI, I tested your code and it appears to provide consistent results across:

Windows with .NET Framework 4.7.2
Windows with .NET Core 3.1
Linux with .NET Core 3.1

I ran both your provided .NetFxToUpperInvariant() method and the built-in ToUpperInvariant() across all (non-control) Unicode code points in each environment, and as expected, the only outlier is the built-in ToUpperInvariant() on .NET Core 3.1 on Linux, which contains 307 character-level differences in its results. For those interested, you can find the affected unicode code points in this gist: https://gist.github.com/aaronpburke/65d615a91adc181eb5b3dfa258c96570

StachuDotNet commented 2 years ago

I tried the gist in .NET 6 and .NET 3.1 and found that strings "և" and "ﬁﬂ" did not convert to upper case properly (I was expecting "ԵՒ" and "FIFL", respectively). Here's a repo that replicates the issue. (shows .NET 6. can provide .NET 3.1 repo if relevant)

Should I expect this to work?

cc @tarekgh

tarekgh commented 2 years ago

@StachuDotNet the gist you pointed at was created to mimic the behavior of the casing when running on the .NET Framework and not to be used for .NET 5.0 or 6.0. The result given by the gist code is exactly the result when using ToUpperInvariant on the .NET Framework. So, what you are getting is the expected result.

Also, the file Unicode standard data UnicodeData.txt is not showing these characters have casing form.

FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;

Please note, casing operation is one-to-one mapping in the .NET.

znakeeye commented 5 days ago

Here are some concrete examples demonstrating the problems:

// 'ß' is U+00DF LATIN SMALL LETTER SHARP S // 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S

string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"

On my system, toLower is not "ẞ". Running Windows 11 23H2, .NET 8.

Sample code

using System.Runtime.CompilerServices;

Console.OutputEncoding = System.Text.Encoding.UTF8;

const string small = "ß";
const string capital = "ẞ";

string toUpper = small.ToUpperInvariant();
string toLower = capital.ToLowerInvariant();

Print(small);
Print(toUpper);

Print(capital);
Print(toLower);

static void Print(string s, [CallerArgumentExpression(nameof(s))] string? paramName = null)
{
    Console.WriteLine($"{paramName,-10}: {s} ({(int)s.First()})");
}

Output:

small     : ß (223)
toUpper   : ß (223)
capital   : ẞ (7838)
toLower   : ß (223)

Can somebody explain this?

tarekgh commented 5 days ago

@znakeeye

Look at the Unicode Data file you'll see following two entries:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;       <---- ß

1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF; <---- ẞ

You will see the small letter U+00DF ß doesn't define upper casing mapping. But capital U+1E9E ẞ has lower case mapping to U+00DF ß. The section Development of a capital form has useful information. Feel free to complain to Unicode if you think the behavior is not right.

znakeeye commented 5 days ago

Interesting. Thanks!

I was mostly concerned about my results not matching the results described in the original post. Did they change the behavior some time after 2019?

dotnet / runtime