Improving the developer experience with regard to default string globalization

GrabYourPitchforks commented 3 years ago

Improving the developer experience with regard to default string globalization

Note: This is a companion document to https://github.com/dotnet/docs/issues/21249, which discusses the behaviors as they currently exist. It's assumed the reader is generally familiar with that document.

Summary

Since .NET 5 standardized on using ICU instead of NLS for globalization across all supported platforms (see breaking change notice), we've received a few reports of string and globalization APIs not behaving as expected.

These reports generally fall into one of two buckets:

The developer wasn't intending to make their code globalization-aware, and the switch to ICU exposed an unintentional dependency in the developer's code which led to an unwanted behavioral change. See: https://github.com/dotnet/runtime/issues/43736, https://github.com/dotnet/runtime/issues/42234, https://github.com/dotnet/runtime/issues/40922, https://github.com/dotnet/runtime/issues/40258, https://github.com/dotnet/runtime/issues/36177, https://github.com/dotnet/runtime/issues/33997, https://github.com/dotnet/runtime/issues/43802, https://github.com/dotnet/runtime/issues/36891, https://github.com/dotnet/runtime/issues/43772, https://github.com/dotnet/runtime/issues/44687, https://github.com/dotnet/runtime/issues/45912, https://github.com/dotnet/runtime/discussions/46265, https://github.com/dotnet/runtime/issues/46569, https://github.com/dotnet/runtime/issues/47077, https://github.com/dotnet/runtime/issues/59120, https://github.com/dotnet/runtime/issues/59661.
The developer intended to use a globalization feature, and the switch from NLS to ICU introduced an unexpected behavior. See: https://github.com/dotnet/runtime/issues/43795, https://github.com/dotnet/runtime/issues/39523.

For scenarios which fall into the second bucket, the runtime offers a compat switch to restore the old behavior. The remainder of this document will focus on the first bucket. This bucket is where most of the reports seem to fall.

To address these, we plan a three-pronged approach: improve documentation in this area, audit existing tutorials and code samples, and change the new project experience to reduce the "pit of failure" surface for .NET developers. We are soliciting the community's feedback on all of these. Please use this issue to discuss.

Improving documentation regarding string APIs

There are currently breaking change and compatibility noticed posted at the following locations:

We are additionally tracking through https://github.com/dotnet/docs/issues/21249 improvements to the string docs all-up, including recommendations for which Roslyn analyzer rules to enable and updating the string docs to include a table of the default globalization mode for each of the APIs.

This work alone won't make the experience better, but it should help make information more complete and accessible to developers who are searching for it. This does not solve the problem of "How would somebody know to seek out this information in the first place?" - later sections of this proposal should address those.

Reviewing samples and tutorials for proper string handling patterns

We should review samples and tutorials to ensure they're not ingraining incorrect code patterns in our audience's minds. This is potentially a very large undertaking due to .NET samples being scattered across many different sites, some of which haven't been updated in over a decade.

At the very least, the samples that accompany the API documentation should be clarified so that they avoid performing linguistic operations when ordinal operations were likely more appropriate. A not at all exhaustive list is provided below.

System.String docs (shows using linguistic IndexOf overload)
Search strings (encouraging linguistic StartsWith call)
Deferred execution example (LINQ to XML) (shows culture-aware ToUpper)

A simple search for these patterns will likely produce many false positives. We also shouldn't assume that every such instance of a culture-aware comparison is incorrect. More on this later.

Dev experience changes to reduce the "pit of failure"

The document https://github.com/dotnet/docs/issues/21249 suggests that developers manually enable the Roslyn analyzer rules CA1307 and CA1309 in their code bases. That rule will flag calls to string.IndexOf(string) and other culture-aware APIs, requesting that the developer explicitly pass StringComparison.CurrentCulture to indicate "yes, I really did intend for this to be culture-aware."

This helps, but it requires an active gesture on the part of the developer. Ideally we would instead alert developers to potential problems (or even fix these problems automatically!) without requiring the developer to have first sought help.

There are some various options we can take here, each with their own pros and cons. I'll describe some potential paths in a section below.

A bit of historical context

When .NET Framework was introduced two decades ago (!!!), the killer app was creating rich UI-based applications. .NET Framework introduced WinForms as the successor to VB6's rapid application development model. It also introduced WebForms as a way to create web-based GUIs with similar fidelity to native WinForms apps. End users interface directly with these app models, which led to rich localization and globalization support being weaved throughout these app models from a very early stage.

Part of this early work involved ensuring that string instances could unambiguously hold data in any supported language. Historically this had been accomplished by storing the string as a sequence of 8-bit C-style chars (LPSTR), leaving their intepretation up to the active Windows code pages. .NET uses UTF-16 for its string representation, removing the reliance on code pages.

At the same time, since user interaction was such a crucial component of early .NET applications, it was important that applications behave according to the user's expectations. This is especially pronounced in applications that perform searching and collation, such as a personnel system which lists all employees' names alphabetically. The end user might expect ordering to be performed according to the conventions of U.S. English, or of Hungarian, or of Turkish, or of another language, depending on how they've configured their system. (The rules for performing Hungarian or Turkish collation are non-trivial.)

To support these scenarios, the .NET Framework APIs which search for one substring within another string or which compare two strings use the thread's current culture (StringComparison.CurrentCulture) by default. This includes APIs like string.IndexOf(string), string.CompareTo(string), and similar. Contrarily, .NET Framework APIs which search for individual chars within a string use ordinal (StringComparison.Ordinal) searching by default. This includes APIs like string.IndexOf(char) and string.StartsWith(char).

string.Contains is the exception to this rule. It was introduced in .NET Framework 2.0 - after the other string APIs - and does not follow the same convention. For string.Contains, both the string-based and the char-based overloads use ordinal behavior by default.

An important aspect of globalized behavior is that it's not stable across platforms. Language itself is fluid, and conventions change. The globalization data that ships with the operating system encompasses not just language conventions, but also geopolitical concerns such as the default currency symbol, and the OS regularly updates this data. While these updates are not intended to be breaking, they make no guarantee of behavioral compatibility.

This globalized-by-default behavior might be appropriate for UI-based applications where an end user is interacting directly with the app, it's often not appropriate for all other scenarios. Web and backend services usually need to process data in a manner that remains consistent across runs and is not influenced by any linguistic conventions. Command-line tools similarly should usually exhibit consistent behavior regardless of the language of the user who launched the tool. Even within a GUI app running on a user's local machine, any underlying business logic should usually run uninfluenced by the user's culture settings.

Now that .NET has adopted Span<T> as a first-class citizen (and ReadOnlySpan<char> as the convention for a cheap string slice), there are also consistency issues to deal with. All Span<T>-based extension methods (including extension methods that operate on ReadOnlySpan<char>) are ordinal by default, unless an explicit StringComparison has been provided. As developers begin using span-based code more frequently, the risk of mixing and matching linguistic and non-linguistic operations on the same text increases.

string str = GetString();
bool b1 = str.StartsWith("Hello"); // uses 'CurrentCulture' by default

ReadOnlySpan<char> span = str.AsSpan();
bool b2 = span.StartsWith("Hello"); // uses 'Ordinal' by default

This mismatch of expectations could cause developers to introduce latent bugs into their code bases.

Possible paths forward

Option A: Enable Roslyn analyzer warnings by default

As mentioned earlier, the Roslyn analyzer rules CA1307 and CA1309 are intended to alert developers when they're invoking an string-based API that uses linguistic behavior by default. We can go further and enable these rules by default in applications targeting .NET 6+, producing compiler warnings when these patterns are observed. We can also mark APIs like string.IndexOf(string) as [EditorBrowsable(Never)], effectively hiding them from Intellisense and guiding developers toward the StringComparison-consuming overloads.

The developer would see the warnings both within the Visual Studio IDE and on the console during compilation.

string str = GetString();
if (str.StartsWith("Hello")) // This line produces warning CA1307.
{
    /* do something */
}

if (str.StartsWith("Hello", StringComparison.CurrentCulture)) // Explicit comparison specified, no warning produced.
{
    /* do something */
}

Pros: The developer is alerted to the problem early, potentially before they even observe the problem in production.

Cons: This may introduce noise in code bases where the developer truly did intend to call globalization-aware APIs, including within enterprise code bases which have been brought forward across several .NET Framework versions. It also risks introducing a very steep learning curve for new .NET developers, who are now confronted with globalization-related issues while still within the first few minutes of writing their first "Hello, world!" application.

Alternative proposal: Enable these rules in all application types except WinForms and WPF. This assumes that calls to methods like string.IndexOf(string) where the user intended the default globalization behavior are very rare outside of WinForms and WPF projects.

Option B: Provide a compatibilty switch to change the string defaults

Under this proposal, we provide a switch which forces all string APIs to default to Ordinal unless an explicit StringComparison has been provided. This encompasses string.IndexOf(string), string.Compare, and similar APIs. Globalization-specific APIs like System.Globalization.CompareInfo would be unaffected by this switch.

This switch would be application-wide, just like the existing globalization switches. There would be no facility for individual libraries to control this behavior. Library developers would still need to call APIs which take a StringComparison parameter if they want a strong guarantee on what behavior they'll get. (Library devs may want to enable the Roslyn analyzer rules to help flag non-compliant call sites.)

This switch would default to Ordinal and would be distinct from the UseNls switch that .NET 5 already provides. The two switches could be toggled independently. Defaulting string APIs to Ordinal matches how strings behave in other languages like C/C++, Java, Python, Rust, and others. Interestingly, Silverlight 2 and 3 also shipped with "string defaults to Ordinal" behavior, but this was later reverted with Silverlight 4. This switch would also mean that string.ToUpper and string.ToLower become equivalent to string.ToUpperInvariant and string.ToLowerInvariant.

Underlying this proposal is an assumption that stringy operations should be ordinal unless the call site explicitly requests otherwise. This makes writing globalization-friendly code a deliberate action rather than an automatic behavior. WinForms UI controls like list boxes could still behave in a manner appropriate for their own scenarios.

Pros: Provides uniformity across the API surface. Also provides significant performance increases since ordinal operations are considerably faster than linguistic operations.

Cons: This could be a substantial breaking change, especially for large applications which can't audit every line of code within third-party dependencies. It also deviates from documented defaults. This could cause confusion if somebody is following an older tutorial or if somebody really did intend to invoke a linguistic operation. We'd also need to figure out how it affects globalization-aware APIs like int.ToString and int.Parse.

Option C: Shenanigans with reference assemblies

This is akin to Option B above but is intended to be less breaking to the .NET ecosystem. Here, we introduce no globalization switch, and we don't change any existing runtime behavior. Instead, we make two changes to .NET 6's reference assemblies.

Remove string API overloads that don't take StringComparison.
Change existing string API overloads which take StringComparison to default these parameters to Ordinal.

Consider overloads of string.StartsWith. Here is how the overloads currently appear in the reference assemblies and how they would appear after this proposal.

//
// .NET 5 reference assemblies
//
public sealed class string
{
    public bool StartsWith(char value);
    public bool StartsWith(string value);
    public bool StartsWith(string value, bool ignoreCase, CultureInfo? culture);
    public bool StartsWith(string value, StringComparison comparisonType);
}

//
// .NET 6 proposed reference assemblies
//
public sealed class string
{
    public bool StartsWith(char value);
    // public bool StartsWith(string value); // (REMOVED)
    public bool StartsWith(string value, bool ignoreCase); // (ADDED, to accelerate OrdinalIgnoreCase scenarios)
    public bool StartsWith(string value, bool ignoreCase, CultureInfo? culture);
    public bool StartsWith(string value, StringComparison comparisonType = StringComparison.Ordinal); // default value added
}

The end effect of this is that if a call site reads someString.StartsWith("Hello"), the .NET 6 compiler will no longer bind the call site to string.StartsWith(string). It will instead be bound to string.StartsWith(string, StringComparison) with the value Ordinal burned in at the call site. Existing assemblies which were compiled against .NET 5 or earlier will continue to call the original method, which still exists within the runtime and still has its old behavior.

Pros: Provides uniformity across the API surface, while retaining behavioral compatibility for assemblies which don't target .NET 6.

Cons: This feels like an abuse of the reference assembly system. It also means that if you're inspecting code, you need to know its target framework (by cracking open the .csproj!) to deduce what the runtime behavior will be. There may also be issues with dynamic compilation and other scenarios where the runtime assemblies are used directly instead of using reference assemblies.

Option D: Revert back to NLS by default when running on Windows

We flip the switches so that ICU is no longer the default globalization stack when .NET apps run on Windows. This does not back out the "ICU everywhere" feature; apps running on Windows can still opt-in to using ICU if desired.

This needn't be exclusive of other options. For example, this can be undertaken jointly with obsoleting APIs which are culture-aware by default. The goal of this proposal is to act as a compat shim rather than to address any latent bugs which might exist in today's callers.

Pros: .NET Framework and .NET Core applications which were built and tested on Windows will continue to work the same way when ported to .NET on Windows.

Cons: Like .NET Core, .NET applications will behave differently across different OSes. Without compile-time alerts, it does not prevent new incorrect call sites from being introduced into the wider .NET ecosystem.

Option E: Do nothing

We take no proactive measures regarding the developer experience. All of our efforts are focused solely on documentation, samples, and similar developer education. Basically, leave the world as it exists today in .NET 5.

Pros: We understand the world as it exists today. Once developers observe a misbehavior in their applications, they can consult our documentation or third-party channels like StackOverflow to self-assist.

Cons: It leaves the "pit of failure" fairly wide and relies on developers to experience a problem before seeking assistance. This potentially leads to the continued introduction of fragile code into the wider .NET ecosystem.

Follow-up work

If we could answer the following questions, that might help inform our decision on what path to take. This issue does not propose a way to discover the answers to these questions.

How often are developers writing UI-layer code vs. business logic or other non-UI code? How can we detect this layering even within a single project?
What percentage of calls to string.IndexOf(string) would in practice return different results if we were to flip the default from CurrentCulture to Ordinal?
Do we need to address APIs like int.Parse at the same time? Example: Does a proposed 'ordinal by default' switch also mean that int.Parse and decimal.ToString are invariant by default?
What other options are missing from the above list?

ericsampson commented 3 years ago

Not sure if this is identical to option D, but to start with:

Make ICU opt-in instead of opt-out for .NET 5. Use the next year to educate people, allowing them time to fix their stuff. The can opt-in to the new behavior in order to fix their applications on their own schedule. In combination with some of the above proposals. Make ICU opt-out for .NET 6 (if that's even going to be an option, I guess). (edit - I think this is basically option D.)

huoyaoyuan commented 3 years ago

Related: Parsing a number under invariant culture info is a bit slower and more complex in code. (querying NumberFormatInfo) This is highly unexpected, as converting between numbers and strings are likely a hot path in any type of application.

There must be a path for number that doesn't touch globalization totally. NumberFormatInfo.Invariantainfo is still costful.

ericsampson commented 3 years ago

The challenge I see with A,B,C is that different people will have equally valid opinions on what the default behavior should be when the methods that don't take StringComparison are used. One of the camps is going to be unhappy, either way.

For option B, it isn't 100% clear to me - are you saying that the default OOTB setting will be Ordinal (ie change in behavior) and that people who want current behavior will have to opt-out? Or that people would have to opt-in. For something like Option B, if it's opt-in how will this be discoverable for the average Jane/Joe in order to get good % coverage of people who migrate.

For some reason, I kind of like the idea of Option B over Option C. In my mind it's basically like saying "here's a switch that controls the default value of StringComparison if it's unspecified, choose what makes sense for your project type".

aolszowka commented 3 years ago

I strongly preference Option A.

To address these concerns:

Con: This may introduce noise in code bases where the developer truly did intend to call globalization-aware APIs, including within enterprise code bases which have been brought forward across several .NET Framework versions.

If the intent was truly to call globalization-aware API's I don't think that ask is too steep to explicitly declare your intention, even when pulling code forward.

We have made gigantic leaps forward previously with breaking API in other projects. One such example that comes to mind is when NUnit3 broke several NUnit2 behaviors. We used the excellent work from @wachulski and this Roslyn Analyzer/CodeFix Project https://github.com/nunit/nunit.analyzers. While the transition was not pleasant it was made possible through the hard work of the open source community and Roslyn's rewrite abilities.

I cannot stress how critical it is to have the code fix offer to correct this, this turns an impossible task into a much more palatable one.

Con: It also risks introducing a very steep learning curve for new .NET developers, who are now confronted with globalization-related issues while still within the first few minutes of writing their first "Hello, world!" application.

While I agree that it is not pleasant to encounter this out of the gate, this might be a great learning opportunity for developers on additional considerations when programming cross platform. There are plenty of practices that are sub-optimal when learning to program in a new language, throwing an additional warning at that time I think is a "good thing" as it enforces "good" coding habits out of the gate.

One example of such practice that comes to mind is that you can find hundreds of thousands of examples out on the internet of people performing ad-hoc Sql using SqlCommand(string) performing string concatenation as opposed to using any of the widely available SQL Injection Sanitation libraries or SqlPreparedStatements. This problem is not unique to C# (in fact I would argue it is much more prevalent on the PHP side). This doesn't mean that out of the box the Compiler/IDE/Analyzer set shouldn't alert you to this practice being considered bad, it is just another place where a learning experience can occur. The onerous is then put forth on the developer to make a decision on if they will change their code or if they will simply mark it as ignored.

I would say that this:

We can also mark APIs like string.IndexOf(string) as [EditorBrowsable(Never)], effectively hiding them from Intellisense and guiding developers toward the StringComparison-consuming overloads.

Might be a bridge too far, however I am willing to accept it for now, this is ideal because copy and paste programmers from something like StackOverflow will continue to work, albeit throw a warning that can then be addressed.

GrabYourPitchforks commented 3 years ago

Parsing a number under invariant culture info is a bit slower and more complex in code.

Presumably if we had an API whose behavior was known to be invariant (see Utf8Parser for an example), we'd hard-code knowledge of the expected syntax and wouldn't consult globalization tables at all. For example, we wouldn't need to look up the invariant culture's negative number sign or thousands separator. We'd just hard-code the literals '-' and ','.

huoyaoyuan commented 3 years ago

Another concern: how about other entries that cannot pass the choice, namely interfaces IComparable<T>? It seems that current IComparable<T> implementation of string defaults to culture aware path. That means using string for key in ordered collection, changing the thread's culture later will break the collection. In this case, changing the default behavior is actually a "fix".

Generally speaking, string globalization problem has a very wide impact, and very long history. It worth a specially designed mechanism.

Dweeberly commented 3 years ago

There is no happy path here. The root of the problem is that developers want to think of a string as a char[]. Despite underlying implementation details those aren't the same. In fact, even a char isn't always a character (since some might require four bytes to represent). Strings are really an array of graphemes, but perhaps that ship has sailed. If you have a single grapheme then a method like ToUpper() makes sense. If all you have is a UTF16 (char) then ToUpper() might make sense, or not. Same with IndexOf(), etc. Strings should probably be defined along the lines of String<"en-US", UTF16, CompareOptions.None> (Options F?), but I'm guessing that ship has also sailed. If the string carried along the necessary information then all string functions could deal in a language appropriate way and if you wanted ordinal behavior you would cast or copy the data into an array.

Sadly, "you don't always get what you want ...", or so the Rolling Stones tell me. I'd prefer consistence and the move to ICU has already introduced breaking changes. For that reason I would choose Option B. Let everything be ordinal unless explicit parameters indicate otherwise. Make it a global option, and generate warning when possible indicating possible bad assumptions in the original code. As noted this would also make it easier for people moving from other languages to C#.

En3Tho commented 3 years ago

Can you make every method in String/char Span class default to 1 single mode which then could be switched by runtime config?

Like for UI people will set/or maybe projects will be created with this setting by default { "DefaultStringBehaviour": CurrentCulture } and for backend there will be { "DefaultStringBehaviour": Ordinal }? Setting will exist in appsettings or runtimeconfig, could be set from IDE.

Then explicitly set behaviour when calling methods will always be preserved, but default could be changed by users depending on their needs.

Furthermore A Roslyn Analyzer could read appsettings and give warnings based on current default behavior mode if it sees that comparison won't ever work in selected mode (for const strings for example)

KalleOlaviNiemitalo commented 3 years ago

Can you make every method in String/char Span class default to 1 single mode which then could be switched by runtime config?

Would "every method" include Equals(String) and op_Equality(String, String), so that the runtime config would be able to make them use the CurrentCulture mode? If so, I think developers would expect C# switch on String to do the same, but the C# compiler would then have to generate two versions of each such statement. The JIT compiler could delete one of them, though.

lupestro commented 3 years ago

My biggest concern is that str.contains(something) and -1 != str.indexOf(something) always produce the same result, even if that result changes. Making those behave differently is likely to break more programs more dramatically than a subtle behavior change in what they both match.

stephentoub commented 3 years ago

My biggest concern is that str.contains(something) and -1 != str.indexOf(something) always produce the same result, even if that result changes. Making those behave differently is likely to break more programs more dramatically than a subtle behavior change in what they both match

Just to make sure it's clear, this is already not the case; they already may behave differently (which is part of the problem).

lupestro commented 3 years ago

"Just to make sure it's clear, this is already not the case; they already may behave differently (which is part of the problem)."

Ah, that's useful information!

Under what circumstances (any concrete example will do) would they behave differently from one another before .NET Core 3.1 and the changes to ICU that it carried with it (at least on some platforms)?
If the default used for indexOf(mySubstring) is a better choice, why isn't this same default equally appropriate for contains(mySubstring), which performs no more than a portion of the same operation?

For ensuing discussion, it would help to understand this.

aolszowka commented 3 years ago

@lupestro

Under what circumstances (any concrete example will do) would they behave differently from one another before .NET Core 3.1 and the changes to ICU that it carried with it (at least on some platforms)?

Several of them are linked in the above document in the first few lines of the issue:

The developer wasn't intending to make their code globalization-aware, and the switch to ICU exposed an unintentional dependency in the developer's code which led to an unwanted behavioral change. See: #43736, #42234, #40922, #40258, #36177, #33997, #43802, #36891, #43772.

I can speak to this one specifically since I reported it: https://github.com/dotnet/runtime/issues/43802

Based on comments and my understanding when the switch to ICU occurs the behavior that is currently experienced in Linux and MacOS will then be experienced in Windows. The jury is out on what the expected behavior was. I can say for that specific example our use case was an internal tool designed to sort Solution Files in a deterministic manner.

After upgrading to .NET 5 on Windows (where all of our developers are) this would result in several hundred solution files now sorting in a different deterministic manner, causing several developers to question why there are changes in portions of the monolithic code base they did not touch.

TODAY they would have encountered this as well, IF they had been developing on alternative platforms (Linux or MacOS), however we've been getting away "for free" for a long time due to our heavy investments in Windows.

stephentoub commented 3 years ago

Under what circumstances (any concrete example will do) would they behave differently from one another before .NET Core 3.1 and the changes to ICU that it carried with it (at least on some platforms)?

using System;
using System.Globalization;

class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("de-DE");
        Console.WriteLine("ss".IndexOf("ß") != -1); // culture-sensitive
        Console.WriteLine("ss".Contains("ß")); // culture-insensitive
    }
}

If the default used for indexOf(mySubstring) is a better choice, why isn't this same default equally appropriate for contains(mySubstring), which performs no more than a portion of the same operation?

From my perspective with 20/20 hindsight, the inconsistency is a mistake of history; Contains(string) was introduced after IndexOf(string), and I don't know whether it using ordinal instead of being culture-aware as is IndexOf(string)/StartsWith(string)/EndsWith(string) was a deliberate difference (e.g. "We don't like that IndexOf(string) is culture-aware, so let's make new things we add be ordinal instead") or an accident.

ericsampson commented 3 years ago

@GrabYourPitchforks , to clarify option B: are you saying that the default OOTB value for this proposed setting would be Ordinal (ie change in behavior) and that people who want current behavior will have to opt-out? It would be good to clarify this in the original text, because I think I know the answer based on the phrasing but I'm not sure. Thanks again for the energy you've put into these proposals! And improving the docs/messaging/visibility.

GrabYourPitchforks commented 3 years ago

@ericsampson The assumption is Ordinal by default. The text implies this but doesn't state it outright. I'll clarify in the next rev.

ericsampson commented 3 years ago

I for one like the sound of Option B, if the decision was made that it's time to do something significant to make things better in the long run, since this ICU change is already going to break some eggs :)

Levi, if that was the chosen path, would you envision two separate switches like this (naming/polarity TBD) having defaults as shown:

System.Globalization.UseNls = false;
System.Globalization.OrdinalStringComparison = true;

GrabYourPitchforks commented 3 years ago

@ericsampson I've clarified the Option B proposal above. Your assumption re: the default behaviors is correct.

lupestro commented 3 years ago

I think I would prefer option B coupled with a correction to the historical inconsistency around contains() - its behavior should track the behavior of indexOf, regardless of which behavior is chosen.

pyrocumulus commented 3 years ago

I do not have a clear preference regarding the options yet, but I can look at it from an end user perspective. I do think that the direction should definitely be towards making the behaviour of the default overloads of Contains() and IndexOf() more alike and not further apart. Yes there has always been an inconsistency, but I fear that increasing this inconsistency will lead to a less developer friendly environment. In an ideal scenario, the calls to IndexOf() should always return higher than -1, when Contains() returns true.

I'm not familiar with how other languages deal with this, but that's just common sense talking. Especially beginning developers will likely check string content through Contains() before retrieving the subsequent index of found string. Obviously this is not necessary if you just check for the value -1 instead using Contains() beforehand but it is likely to happen (I've seen it when training new developers). Increasing the difference between the two calls will make for bugs which aren't obvious by looking at the code.

Using default compiler warnings to solve this issue seems less friendly. I feel like making applications globalization aware is a choice a developer has to make at some point in time, but it's certainly not required for all applications. I agree with the point that adding default warnings for this, will burden beginning developers with something they have zero knowledge about. Obviously learning about this is good but if you introduce it too early, it'll be more of an annoyance than anything else. Adding the warnings won't ensure that the developer learns proper handling either, he/she might just glance at MSDN and just pick an option and paste that single option in every call it's needed without thinking much of it. Just to be done with it.

Having the behaviour Ordinal by default (which would be my preference), will guarantee consistent behaviour across platforms. If a developer starts to make his/her application culture aware, obviously at that point enforcing explicit StringComparison argument is pretty much necessary. It's not practical to rely on default overloads in that case, in part because it makes code reviews harder (you have to keep the defaults in mind every time you see a call). So at that point I feel the warnings are less of a burden and actually prove valuable. That how it's been and it seems logical to continue that approach (no default warnings for this, manually opt-in to them). The company I work is busy with making a large application globalization aware and when that direction became clear we just enabled those warnings. There were tons of them to handle but in the end the result was very clear and explicit.

pinkli commented 3 years ago

In my experience, most of the time, embeded globalization of basic string methods seems to be a premature optimization: we will do it if we really intent to; but for now, please just run the code as it appears. Just as @pyrocumulus points out. If we have to append StringComparison argument everytime to make it correct, will just frustrate many developers. And the code will read tedious. We average devs expect IndexOf(string) and Contains(string) to be semantically exchangable.

Seabizkit commented 3 years ago

What should happen is, if behavior is different and may change then its ambitious code, this should be a build error, and dev should be forced to switch to non-ambitious . I am not a fan of building code, which complies successfully but its actually not what i thought it was, i would rather it hard error and give suggestions...for replacing. ambitious code should just be removed... force this as steps for the migration path update to .net 5, not doing this will just create more confusion.... i see no other option which does not create frustration.

ericsampson commented 3 years ago

@GrabYourPitchforks , given that 5.0 is coming out very soon, which of options A,B,C,D are going to be implemented for 5.0 GA? Thanks : )

GrabYourPitchforks commented 3 years ago

@ericsampson This issue is tracking potential Future (.NET 6) work. For .NET 5, we improved the docs, and there's a global runtime switch to re-enable the legacy Windows behavior. That switch has no effect on non-Windows platforms.

naine commented 3 years ago

My two cents:

Option A: This option seems good. The con that concerns me most is the learning curve issue, but I feel this would be largely mitigated if [EditorBrowsable(Never)] were employed. Ultimately this would be a bit of a pain point, but one I feel would be worth it in the long run.

Option B: I feel like this compatibility switch will break more code than it will fix. Library authors will suddenly be forced to always be explicit with no guarantees to the defaults, and will have to go through all their existing code to ensure this. Even application authors when turning this switch on may have to go through large amounts of existing code to look for callsites that may have intentionally been using APIs that are culture-sensitive by default. Ordinal behaviour by default would have been great if it were like that from the beginning, but unfortunately it wasn't.

Option C: While this is certainly an interesting option that seems good in theory, it feels like a hack, and I'm sure it will cause all sorts of obscure issues that are hard to foresee. Plus, similar to the issues above with B, this could cause previously-correct code to break once retargeted and recompiled.

Option D: Certainly not in favour of this. The most important thing to me is to keep consistent behaviour across OSes. Depending on how easy it is to opt-in to preferring ICU (ideally there would be an MSBuild property), this might not be so bad. But ideally all developers, and especially library authors, should be pushed towards ensuring compatibility with ICU.

Option E: Honestly not a bad option.

ericsampson commented 3 years ago

@GrabYourPitchforks oh. I'm a little bit bummed then that ICU was kept opt-out vs opt-in for 5, which would have eased the transition.

jyrkive commented 3 years ago

An ordinal comparison mode ("option B") would be convenient even if it's disabled by default. As the OP points out, there are applications which simply don't care about globalization, such as company-internal tools. Executing System.Globalization.CultureInfo.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture; some time in early startup is useful for them, but an "ordinal culture" would be even more convenient in applications where most string processing is intended to use ordinal comparisons.

Zintom commented 3 years ago

An ordinal comparison mode ("option B") would be convenient even if it's disabled by default. As the OP points out, there are applications which simply don't care about globalization, such as company-internal tools. Executing System.Globalization.CultureInfo.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture; some time in early startup is useful for them, but an "ordinal culture" would be even more convenient in applications where most string processing is intended to use ordinal comparisons.

I like the idea of an "ordinal culture", just a really barebones way of working on strings.

fubar-coder commented 3 years ago

@ericsampson

* Make ICU opt-in instead of opt-out for .NET 5. Use the next year to educate people, allowing them time to fix their stuff.

This will not work, because people will usually only listen to education when they run into problems.

naine commented 3 years ago

Another thing about option D is that it essentially amounts to damage control so delayed that it becomes unnecessary.

Presumably the motivation behind reverting the default on Windows is to improve compatibility with pre-.NET 5 software. But the damage to that software is done. .NET 5 is out. Over the next year, people are going to have to fix this software, or find ways to work around the compatibility issues. After this, the benefits of reverting the default to NFS diminish significantly.

GSPP commented 3 years ago

This is a very thorough analysis of the options which is appreciated.

I do not see how the defaults could possibly be changed. This causes so much breakage. Much smaller compat changes are routinely rejected. This would be a massive one. I believe there has been a change in the past and, if I'm correct on that, I consider that to be a mistake.

A config switch would affect the application and its libraries together. Code from multiple teams/vendors lives in the same process. They might require different treatment. I don't think a switch can work.

Real apps consist of hundreds of thousands of lines of code spread over lots of projects and libraries. Nobody can know what string processing is happening in all of that. It is not feasible to upgrade such a body of code.

I'd do this:

Carry all APIs unchanged for the next 20 years. Compat is a key value proposition of .NET.
Guide towards always specifying a StringComparison. We have multiple tools available to guide in that way.
Make new APIs maximally consistent with the existing APIs.

I do not quite understand why the span APIs were designed in an inconsistent way. This ticket (and others) lament the fact that there's lots of inconsistency and now we have additional inconsistency. Now the inconsistency is even based on the specific type and syntax used. How was this decision reached?

I appreciate the search for solutions on this issue a lot. It is a vexing problem in a core part of the BCL design. In my opinion, we need to accept that not much can be done... And that's fine.

zvrba commented 2 years ago

This is especially pronounced in applications that perform searching and collation, such as a personnel system which lists all employees' names alphabetically. The end user might expect ordering to be performed according to the conventions of U.S. English, or of Hungarian, or of Turkish,

I cannot believe how short-sighted string APIs were designed ("CurrentCulture fits all"), given they're from early 2000s. HR person in an international corporation with HQ in USA viewing personnel list containing employees all around the world. What is the "user's expectation" about anything given a mix of English, French and, say, Japanese names written in Kanji???

That said, I'd prefer option B: a switch to change defaults to "ordinal everywhere".

ygoe commented 1 year ago

The trouble really is that nobody knows about this. I was checking some PowerShell 7 documentation and accidently stumbled over a side note that affects all .NET! After 20 years of using .NET. Even if an analyser might tell me that a specific use might possibly be unsafe, the problem lies more deeply. Strings can be two entirely different things in .NET. They can be a sequence of characters, or a language-specific thing. Both behave fundamentally differently. It should always be made clear which you intend to do. Like with HtmlString in ASP.NET. It's a string, too, but with a different meaning. Different enough to make a separate class for it. Language-specific strings have deserved the same, considering these massive consequences.

When you see a method, you have no clue what it does. And the situation is so inconsistent that it's almost impossible to remember. (Like with PHP functions.) Some methods already have overloads that allow specifying the intent. Some like string.ToLowerInvariant or static string.CompareOrdinal exist as clear alternatives. Some still cannot be configured at all, like string.CompareTo. When I'm dealing with this in my code, I add a bunch of extension methods that make this mud clear again. One of them is object.ToStringInvariant (also for float, double and decimal). More will have to follow, like string.IndexOfOrdinal (with an ignoreCase parameter, or a separate string.IndexOfOrdinalIgnoreCase method). The overloads with StringComparison or CultureInfo are too long to write everywhere. It causes too much distraction when reading the code.

So I suggest that these additional methods be added to .NET and preferred everywhere. The old methods and overloads with unclear behaviour could be made obsolete.

As a note on adding analysers causing noise in existing projects: They already do! When a new SDK or runtime version comes out, I regularly see strange notices in my code and simply disable some of them in my global .editorconfig file. So if you're scared of negative effects of new analysers – it's too late already.

And a note on making globalisation-aware code an option: It needs to be! You North-Americans probably never had much trouble with decimal commas. Some big companies even release their products and public file formats with this locale-dependent behaviour. I've seen XML files with both decimal points and commas in them. Those guys have never heard of portable file formats. Or for example Microsoft when it comes to opening CSV files in Excel. Impossible to use worldwide. What a mess.

rjgotten commented 3 months ago

The trouble really is that nobody knows about this.

The trouble really really is that the entire globalization system in .NET is still derivative of Windows OS localization settings rather than using any kind of real rigor that adheres to an all-encompassing solution, like e.g. the Intl ECMAScript Internationalization APIs which properly use the CLDR information rather than trying to hamfist it into something else.

You have base string operations. Those should be ordinal - period. And then you have collation-backed string operations, which should be dependent on internationalization and which should be explicit.

JavaScript got this right. .NET got this wrong on the first foot, and now refuses to let go of its Windows-only past.

And this shows everywhere - not just in string comparison. .NET's globalization system falls criminally short and:

can't do plural forms / plurality rules
can't do ordinal numbers
can't do proper conjuctive and disjunctive list formatting
can't do relative time
can't do units (corollary of not having support for plural forms)
can't do equality comparison or relational comparison on different collations that e.g. fold diacritics
can't do currency properly (no- it's not supposed to be tied to regions)

-- and the list goes on. And it basically will never have these without extending CultureInfo and related data, because all that is is a limited subset of the real CLDR information.

What's basically needed is to dump that entire ecosystem based on CultureInfo and treat the entire thing as a write-off. And then turn over a new leaf and start over based on the CLDR. But fat chance that that'll ever happen. So you'll continue to have these problems as artefacts of trying to shoehorn the CLDR into this old limited and constrained API that was inspired by Windows OS locales.

dotnet / runtime