dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0 #43736

Closed jbogard closed 3 years ago

jbogard commented 3 years ago

Description

I'm extending a package to support .NET 5.0 and ran into a breaking change. Given the console application:

using System;

namespace ConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            var actual = "Detail of supported commands\n============\n## Documentation produced for DelegateDecompiler, version 0.28.0 on Thursday, 22 October 2020 16:03\n\r\nThis file documents what linq commands **DelegateDecompiler** supports when\r\nworking with [Entity Framework Core](https://docs.microsoft.com/en-us/ef/core/) (EF).\r\nEF has one of the best implementations for converting Linq `IQueryable<>` commands into database\r\naccess commands, in EF's case T-SQL. Therefore it is a good candidate for using in our tests.\r\n\r\nThis documentation was produced by compaired direct EF Linq queries against the same query implemented\r\nas a DelegateDecompiler's `Computed` properties. This produces a Supported/Not Supported flag\r\non each command type tested. Tests are groups and ordered to try and make finding things\r\neasier.\r\n\r\nSo, if you want to use DelegateDecompiler and are not sure whether the linq command\r\nyou want to use will work then clone this project and write your own tests.\r\n(See [How to add a test](HowToAddMoreTests.md) documentation on how to do this). \r\nIf there is a problem then please fork the repository and add your own tests. \r\nThat will make it much easier to diagnose your issue.\r\n\r\n*Note: The test suite has only recently been set up and has only a handful of tests at the moment.\r\nMore will appear as we move forward.*\r\n\r\n\r\n### Group: Unit Test Group\n#### [My Unit Test1](../TestGroup01UnitTestGroup/Test01MyUnitTest1):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n#### [My Unit Test2](../TestGroup01UnitTestGroup/Test01MyUnitTest2):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n\r\n\nThe End\n";

            var expected = "\n#### [My Unit Test2](";

            Console.WriteLine($"actual.Contains(expected): {actual.Contains(expected)}");
            Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}");
        }
    }
}

I get different results based on the runtime from .NET Core 3.0 -> .NET 5.0:

.NET Core 3.0:

actual.Contains(expected): True
actual.IndexOf(expected): 1475

.NET 5.0:

actual.Contains(expected): True
actual.IndexOf(expected): -1

Configuration

Windows 10 Pro Build 19041 x64 .NET Core 3.1.9 .NET 5.0.0-rc.2.20475.5

Regression?

Yes, it worked through .NET Core 3.1.9

danmoseley commented 3 years ago

@tarekgh

ghost commented 3 years ago

Tagging subscribers to this area: @tarekgh, @safern, @krwq See info in area-owners.md if you want to be subscribed.

tarekgh commented 3 years ago

This is by design as in .NET 5.0 we have switched using ICU instead of NLS. You can look at https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu for more details.

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

tarekgh commented 3 years ago

forgot to say, if you want IndexOf behave as Contains, you should use Ordinal comparisons at that time.

actual.IndexOf(expected, StringComparison.Ordinal)
safern commented 3 years ago

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

Yeah, if you run this code in Unix targeting netcoreapp3.1 you will see this same behavior:

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $   /home/santifdezm/experimental/indexof/bin/Debug/netcoreapp3.1/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): -1

and as @tarekgh with Ordinal comparison it returns the expected result.

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $  /home/santifdezm/experimental/indexof/bin/Debug/net5.0/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): 1475
ericstj commented 3 years ago

I think this is failing because the mix of \r\n and \n in the source string. If I replace all instances of \r\n with \n it works. Same is true if I make everything \r\n. It's just the mix that is resulting in different results from ICU's comparison.

GrabYourPitchforks commented 3 years ago

Issue as reported on Twitter was that within the 5.0 runtime, repeated calls to string.IndexOf with the same inputs was giving different results on each call.

https://twitter.com/jbogard/status/1319381273585061890?s=21

Edit: Above was a misunderstanding.

safern commented 3 years ago

@GrabYourPitchforks can we update the issue title then? As this is technical a breaking change, but this happens on Windows and Unix... right?

GrabYourPitchforks commented 3 years ago

I pinged Jimmy offline for clarification. It's possible I misunderstood his original issue report. 280-char forums aren't always efficient at communicating bugs clearly. ;)

tarekgh commented 3 years ago

Just to clarify, Contains API is performing ordinal operation. IndexOf without any string comparison flags is linguistic operation and not ordinal. If want to compare Contains behavior with IndexOf, need to use IndexOf(expected, StringComparison.Ordinal). If need to learn more about the difference, https://docs.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings is useful link.

GrabYourPitchforks commented 3 years ago

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

tarekgh commented 3 years ago

@GrabYourPitchforks could you share the link https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu on your twitter replies and mention we have a config switch to go back to old behavior?

safern commented 3 years ago

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

Thanks, @GrabYourPitchforks... based on this, closing it as by design.

tarekgh commented 3 years ago

To add more here, if you want to get the old behavior without switching back to NLS, you can do

    CultureInfo.CurrentCulture.CompareInfo.IndexOf(actual, expected, CompareOptions.IgnoreSymbols)

or

    actual.IndexOf(expected, StringComparison.Ordinal)

instead of

    actual.IndexOf(expected)

and you should get the desired behavior.

ForNeVeR commented 3 years ago

I can't see anything about \r\n vs \n in the linked ICU-related documentation (https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu).

Was it really a planned change?

tarekgh commented 3 years ago

@ForNeVeR it will be hard to list every single difference between ICU and NLS. the document is talking about the main change of switching to ICU. As I pointed before earlier, it is not right to compare the results of Contains with IndexOf without the StringComparison parameters. I have listed above a couple of ways you can get the previous behavior if desired to. From the report of this issue, it looks to me the usage of IndexOf should use Ordinal option and it is incorrect to use the linguistic comparisons in such case. Using linguistic comparison in such case can depend on the current culture which possible to give different results on different environments.

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons. Windows currently promoting using ICU over NLS. ICU is the future anyway. Also, ICU will give the opportunity to have consistent behaviors across Windows/Linux/OSX or any supported platforms. Using ICU will give the opportunity to the apps to customize the globalization behavior if they desired to.

As the document indicated, you still have the option to switch back to old behavior if you want to.

hypersw commented 3 years ago

Ouch, the referenced doc says that ICU/NLS behavior on Windows might silently switch based on the icu.dll availability in the actual environment. This might come as a big surprise for published self-contained apps. I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

tarekgh commented 3 years ago

I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

The ICU now is published as NuGet package. The apps can use such packages to have the self contained app ensure having ICU. look at the app-local section in the doc. In short, the app has full control on the behavior want to get.

ForNeVeR commented 3 years ago

@tarekgh, I agree that the different results between Contains and IndexOf aren't the problem per se.

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

This isn't something I would expect from any locale/NLS/ICU-related changes; in fact, I couldn't think of any other programming language/runtime behaving like that.

Here's a simplified test case, broken (I mean, giving me totally unexpected result) on .NET 5 RC 2:

var actual = "\n\r\nTest";
var expected = "\nTest";

Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}"); // => -1

Should it really work like that? Also, why? What does it trying to do actually?

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone planning such a change. Like, folks from .NET team sat together and discussed:

Does string "\n\r\nTest" contain "\nTest" with ICU enabled? No, clearly doesn't!

And nobody complained? Not a chance!

This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

P.S. Yes, you could argue that one should really always call some variant of culture-independent IndexOf, like with the ordinal flag. But, if you've decided to break it that hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're tons of code in the wild that use IndexOf without the ordinal flags, and that code used to work (in some/most cases, at least). And it will stop working after .NET 5 update.

tarekgh commented 3 years ago

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

It is not true ASCII is locale-independent. look at the link http://userguide.icu-project.org/collation/concepts as an example how the behavior of ASCII characters can differs for different cultures.

For example, in the traditional Spanish sorting order, "ch" is considered a single letter. All words that begin with "ch" sort after all other words beginning with "c", but before words starting with "d".
Other examples of contractions are "ch" in Czech, which sorts after "h", and "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" respectively.

Also, I want to clarify that ICU picks its data and behavior from Unicode Standard which is well thought by many experts. @GrabYourPitchforks is going to post more details about \r\n\ case we are talking about here. but meanwhile, you may familiarize yourself with the doc https://unicode.org/reports/tr29/ especially in the sections mentioning the following:

Do not break between a CR and LF. Otherwise, break before and after controls.
--
GB3 | CR | × | LF
GB4 | (Control \| CR \| LF) | ÷ |  
GB5 |   | ÷ | (Control \| CR \| LF)

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

This is addressed in last paragraph.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone planning such a change. Like, folks from .NET team sat together and discussed: This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

This is well planned worked and thought deeply in it. you may look at the issue https://github.com/dotnet/runtime/issues/826 which we published long ago and shared it publicly. I want to emphasize too that globalization behavior can change at anytime not only for the .NET but for the OS's and other platforms. This is also why we supported ICU app-local feature to allow apps use specific ICU version to ensure the behavior they use is not going to change. Another thing, Windows itself is in the process promoting using ICU and one day ICU behavior will be what the majority of the users is going to use.

like with the ordinal flag. But, if you've decided to break it that hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Actually we have tried before to make the Ordinal behavior as the default before during Silverlight releases and that caused much more problems than what reported here. We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're tons of code in the wild that use IndexOf without the ordinal flags, and that code used to work (in some/most cases, at least). And it will stop working after .NET 5 update.

That is why we are providing a config switch to switch back to old behavior if you desire too. look at System.Globalization.UseNls

ForNeVeR commented 3 years ago

@tarekgh, thanks for thorough explanation!

For now, I think it's better for me to wait for details on this particular \r\n behavior. It isn't clear to me how IndexOf uses "Grapheme Cluster Boundary Rules" (and whether it should do that) when performing this particular search).

Also, even if the Unicode spec is relevant here (which may very well be!), from reading https://unicode.org/reports/tr29/, I'm not sure it forbids to match that last \n. As I read the spec, it says CR | × | LF, where × means "No boundary (do not allow break here)". So, when breaking a sequence \r\n\n, it only forbids to place the "break" between the first and the second characters, but it should be okay to place the "break" before the third one, no? So, I read \r\n\n as two separate "grapheme clusters", and, even if IndexOf has to only match full grapheme clusters and never touch parts of clusters, it should still find substring \nTest inside of string \n\r\nTest.

Also, are you telling that other runtimes/programming languages relying on ICU and/or Unicode spec should behave the same w.r.t. this particular example?

We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

(A necessary disclaimer: I work for JetBrains on several projects, including ReSharper.)

I didn't initially wanted to bring this point here to not sound like an advertisement, but I guess this is very relevant, so I'll have to do it. ReSharper by default will show the following warning for the user code calling IndexOf: image

(please note that I wasn't aware of this particular ReSharper diagnostic before being involved in this thread, so I'm not participating here just to bring this argument)

So, I think it would be a very good idea to show such notification by default in every other tool, too, or maybe even totally deprecate this bogus method with such notice.

tarekgh commented 3 years ago

@ForNeVeR

Also, even if the Unicode spec is relevant here (which may very well be!), from reading unicode.org/reports/tr29, I'm not sure it forbids to match that last \n. As I read the spec, it says CR | × | LF, where × means "No boundary (do not allow break here)". So, when breaking a sequence \r\n\n, it only forbids to place the "break" between the first and the second characters, but it should be okay to place the "break" before the third one, no? So, I read \r\n\n as two separate "grapheme clusters"

That is correct. \r\n\n will be 2 clusters as \r\n and \n.

and, even if IndexOf has to only match full grapheme clusters and never touch parts of clusters, it should still find substring \nTest inside of string \n\r\nTest.

That is incorrect. \n\r\nTest will be split into the parts. \n, \r\n, and Test. it is obvious \nTest cannot part of this string. think about it of replacing the cluster \r\n with some symbol X. now the source string will be \nXTest which doesn't contains \nTest.

Also, are you telling that other runtimes/programming languages relying on ICU and/or Unicode spec should behave the same w.r.t. this particular example?

if using default collation strength level, then the answer is yes. ICU can allow changing the strength level which can affect the result. For example, as I mentioned earlier, doing something like:

CultureInfo.CurrentCulture.CompareInfo.IndexOf(actual, expected, CompareOptions.IgnoreSymbols)

will change the strength level and will make the operation ignore the symbols (which will change the behavior of \n and \r as it will be ignored at that time)

Also, I have wrote pure ICU native C app and ran the same case:

void SearchString(const char *target, int32_t targetLength, const char *source, int32_t sourceLength)
{
    static UChar usource[100];
    static UChar utarget[100];

    u_charsToUChars(source, usource, sourceLength);
    u_charsToUChars(target, utarget, targetLength);

    UErrorCode status = U_ZERO_ERROR;
    UStringSearch* pSearcher = usearch_open(utarget, targetLength, usource, sourceLength, "en_US", nullptr, &status);
    if (!U_SUCCESS(status))
    {
        printf("usearch_open failed with %d\n", status);
        return;
    }

    int32_t index = usearch_next(pSearcher, &status);
    if (!U_SUCCESS(status))
    {
        printf("usearch_next failed with %d\n", status);
        return;
    }

    printf("search result = %d\n", index);
    usearch_close(pSearcher);
}

int main()
{
    SearchString("\nT", 2, "\r\nT", 3);
    SearchString("\nT", 2, "\n\nT", 3);
}

this app will output the results:

search result = -1
search result = 1

which is identical to the behavior you are seeing with .NET.

For now, I think it's better for me to wait for details on this particular \r\n behavior. It isn't clear to me how IndexOf uses "Grapheme Cluster Boundary Rules" (and whether it should do that) when performing this particular search).

For sure clustering is affecting the collation operation. If you look at http://unicode.org/reports/tr29/tr29-7.html, it is clearly states the following:

Grapheme clusters include, but are not limited to, combining character sequences such as (g + °), digraphs such as Slovak “ch”, and sequences with letter modifiers such as kw. Grapheme cluster boundaries are important for collation, regular-expressions, and counting “character” positions within text. Word boundaries, line boundaries and sentence boundaries do not occur within a grapheme cluster. In this section, the Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be tailored for specific locales or other customizations, which is what is done in providing contracting characters in collation tailoring tables.

I am not sure if there will be more details but I'll let @GrabYourPitchforks comment if he has more to add here.

So, I think it would be a very good idea to show such notification by default in every other tool, too, or maybe even totally deprecate this bogus method with such notice.

Thanks! This is same direction we are thinking in too.

jbogard commented 3 years ago

For my own sake, I compared the various overloads between versions:

Method netcoreapp3.1 net5.0
actual.Contains(expected) True True
actual.IndexOf(expected) 1475 -1
actual.Contains(expected, StringComparison.CurrentCulture) True False
actual.IndexOf(expected, StringComparison.CurrentCulture) 1475 -1
actual.Contains(expected, StringComparison.Ordinal) True True
actual.IndexOf(expected, StringComparison.Ordinal) 1475 1475
actual.Contains(expected, StringComparison.InvariantCulture) True False
actual.IndexOf(expected, StringComparison.InvariantCulture) 1475 -1
rcollina commented 3 years ago

Please do include an analyzer for this.

Aaronontheweb commented 3 years ago

This seems like one of those changes that, while good in the long run, while create a massive amount of churn once .NET 5 launches. So if the behavior of these methods differs between .NET 5 and .NET Core 3.1, what's going to happen when a .NET 5 calls an object defined inside a .NET Standard 2.0 library that manipulates a string passed into it from the .NET callsite? Does the old behavior get used or the new behavior?

jbogard commented 3 years ago

@Aaronontheweb new behavior. I saw this initially from an assertion in NUnit3, which targets netstandard2.0. After upgrading, my test started failing when I only changed the target framework.

Aaronontheweb commented 3 years ago

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

petarrepac commented 3 years ago

How many apps will not detect that in unit tests and put that into production? Did .NET team consider the pain and problems and cost this could cause?

pr0vieh commented 3 years ago

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

nice Trap! happy time wasting by hunting weird bugs 🎉 but why InvariantGlobalization doesn't help out with this problem ?

nbevans commented 3 years ago

Please do include an analyzer for this.

Yeah. And don't forget F# support.

How many apps will not detect that in unit tests

Zero - since unit tests aren't meant to test external libs/frameworks like the .NET BCL itself.

My view is that there should be an Assembly-level Attribute which can control the mode of this behaviour change. At least then, you can opt-in/out of it at a per-assembly level. This then means that the .NET Standard problem goes away too.

reflectronic commented 3 years ago

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

I don't see why you can't control it. All string comparisons are either using ICU or NLS. You can opt-out of ICU using the compat switch if you would like, and all of your libraries will revert to the old behavior.

You cannot rely on globalization data staying stable over time. Take it from the Windows team that they are not afraid to break people who depend on stable globalization data. Globalization functions should be a black box; I do not think it makes sense for libraries (particularly ones that target .NET Standard) to say that they are depending on implementation details like this.

I'm sure that just as many people have complained about .NET globalization functions returning different results on Windows vs. Linux (and perhaps even more people haven't even noticed yet). It's better to unify the behavior between Windows and other platforms, and any correct code should not be relying on globalization data being immutable, regardless.

Tyrrrz commented 3 years ago

Would you consider to make a breaking change to also make StringComparison.Ordinal the default comparison strategy? Seeing as globalization is so unstable, it makes sense that at least the default implementation uses a stable algorithm instead. I'm willing to bet that 99.9% of people who use string.Equals(...) or string.Contains(...) etc. without passing StringComparison are not doing it with the explicit intention of handling weird quirks related to locales.

Edit: I guess my question has already been answered:

Actually we have tried before to make the Ordinal behavior as the default before during Silverlight releases and that caused much more problems than what reported here. We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

Aaronontheweb commented 3 years ago

You can opt-out of ICU using the compat switch if you would like, and all of your libraries will revert to the old behavior.

I make libraries for a living, not applications. I'd prefer to have a compile-time way of handling this.

Most of the work we do is InvariantCulture, which per my previous understanding is supposed to be immutable by design. Looks like IndexOf's behavior is different between .NET 5.0 and .NET Core 3.1 under those circumstances too.

petarrepac commented 3 years ago

How can any analyzer help for existing projects?

isaacabraham commented 3 years ago

@petarrepac that's also a C#-only thing.

Tyrrrz commented 3 years ago

@isaacabraham and VB too ;)

safern commented 3 years ago

I'd prefer to have a compile-time way of handling this.

@Aaronontheweb you do have a compile-time way of handling this (when compiling apps). You can add this to your project:

<ItemGroup>
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
</ItemGroup>

EDIT: this is only for apps consuming a library, unfortunately you can't control this when writing a library.

In the long run the Windows team is promoting to move to ICU, so at some point ICU will be the globalization story, and we're just a thin wrapper around OS libraries.

what's going to happen when a .NET 5 calls an object defined inside a .NET Standard 2.0 library that manipulates a string passed into it from the .NET callsite? Does the old behavior get used or the new behavior?

The new behavior will get used, because it depends on the runtime and .NET Standard is just a standard for runtimes to implement. However, notice that this is bringing platform consistency in between Unix and Windows, if you run this same libraries which people have concerns about in Unix, you will get the ICU result as ICU is the backing library in Unix.

tarekgh commented 3 years ago

@reflectronic is right in his all points https://github.com/dotnet/runtime/issues/43736#issuecomment-716681586.

to comment on @jbogard results mentioned here https://github.com/dotnet/runtime/issues/43736#issuecomment-716527590, you can just summarize the results by comparing between the Linguistic behavior between Windows and ICU. By the way, ICU now used by many applications on Windows and it is expected the usage to increase. Also, these results is excluding Linux with .NET Core 3.1 and below. which will show consistency between .NET 5.0 and previous versions on Linux.

The point about the library that used to work will be broken, this is not fully true because such libraries were already broken on Linux.

Would you consider to make a breaking change to also make StringComparison.Ordinal the default comparison strategy?

I mentioned earlier we already tried that before but the size of complains was very big that we couldn't apply it. We are looking at the ways to help developers to be conscious specifying the string comparison flags when calling the collation APIs.

I make libraries for a living, not applications. I'd prefer to have a compile-time way of handling this.

Yes, some analyzer can help in such case. usually look at the collation APIs calls and look which one was not showing the intent of using ordinal or linguistic operation.

How can any analyzer help for existing projects?

Analyzers scan the code and detect the calls of collations APIs that don't pass flags would help to look at such calls and fix if detected have a problem.

that's also a C#-only thing.

The change is inside the .NET runtime which should be global and not restricted to C#.

isaacabraham commented 3 years ago

@tarekgh how does a C# analyser manifest itself in a VB or F# codebase?

reflectronic commented 3 years ago

Most of the work we do is InvariantCulture, which per my previous understanding is supposed to be immutable by design.

Critically, before this change, it was impossible to actually depend on this for cross-platform code; what NLS and ICU behave like, even when using invariant culture, are not always the same (as evidenced by this issue). Like @tarekgh said, this code has been acting differently on Linux this whole time. Now that this change has been made, for any up-to-date Windows install, what "invariant culture" means should actually be consistent on all platforms.

Aaronontheweb commented 3 years ago

This may come as a surprise, but we did eventually find and fix dozens of bugs related to platform differences over the years as a result of user reporting, as I'm sure many other library authors have done for years and years.

I'm just not excited at the prospect of .NET 5 introducing a new crop of unknown unknowns and revisiting those fixes for not just my libraries, but our downstream dependencies too. That's significant economic cost to us that doesn't create new productivity improvements for our users. Someone in MSFT ought to give that some consideration in their costs / benefits discussion on making this change.

Edit: like, I get the technical merits already - yes, thank you. Please help sell the non-obvious economic benefits of making this change as opposed to the costs. Why should users take the leap and upgrade to .NET 5 anyway given what a rat's nest this issue appears to be?

jeffhandley commented 3 years ago

@tarekgh how does a C# analyser manifest itself in a VB or F# codebase?

We can cover C# and VB with a single analyzer (we strive to make our analyzers language-agnostic wherever possible), but we can't get analyzer coverage for F# at this time.

tarekgh commented 3 years ago

@Aaronontheweb anyone using cultural functionality and assume it is not going to change is already broken even if we didn't do this ICU changes. Look at the blog https://docs.microsoft.com/en-us/archive/blogs/shawnste/locale-culture-data-churn from Windows team which telling even NLS behavior is changing for the sake of improvements too. So, the issue here is not about moving to ICU more than just catching any wrong assumptions from the apps/libraries perspective. upgrading to 5.0 is same as upgrading to other previous versions. apps will get a lot of new cool features and apps need to be tested as could be some breaking changes between releases. I am not considering globalization behavior change is really breaking as we always telling globalization can change at anytime between OS's and OS versions. As indicated before, we have the config switch to choose to upgrade to 5.0 with keep using NLS. that will make ICU is not really a factor in the upgrade decision. now with ICU, the apps will have the opportunity to get more consistency across OS's and even can have more control on the globalization behavior if decided to use the app-local ICU. We are providing much more control to the apps than before.

GrabYourPitchforks commented 3 years ago

Related: https://github.com/dotnet/runtime/issues/43802

(That issue doesn't track IndexOf per se. Rather, it discusses the unintended consequences of the the Compare routine defaulting to a culture-aware comparer.)

petarrepac commented 3 years ago

How can any analyzer help for existing projects?

Analyzers scan the code and detect the calls of collations APIs that don't pass flags would help to look at such calls and fix if detected have a problem.

I guess analyzers must be added to the csproj. This will not happen automatically. So, a lot of existing projects will be moved to .NET 5 without those analyzers. Plus, as already mentioned, it will not help for F# projects.

isaacabraham commented 3 years ago

@jeffhandley thank you, that confirms what I already understood. So, it's important to acknowledge that the possible "workaround" will not help F# users (a small market but one that is nonetheless fully supported by MS as a first-class citizen of .NET). I have no idea what this means:

The change is inside the .NET runtime which should be global and not restricted to C#.

tarekgh commented 3 years ago

I have no idea what this means: The change is inside the .NET runtime which should be global and not restricted to C#.

I meant any language using .NET runtime will be affected and not only C#.

aolszowka commented 3 years ago

I am going to strongly voice my opinion again that out of the box there should be an analyzer to uncover these gotchas in the new world, especially if the plan is to change current behavior.

I completely understand the technical merits of doing so and am not suggesting that you not make a change (long term it seems like this is the correct move). Nor am I saying that it was not already documented as best practice. What I am saying is that we really need this to be a big, red, flashing error to developers attempting to lift to .NET 5. Out of the box Developers are going to assume this "just works" otherwise.

Right now today you can use this Roslyn Analyzer Library from @meziantou to find areas that are affected: https://github.com/meziantou/Meziantou.Analyzer/tree/master/docs.

In this particular case this will throw a MA0074 - Avoid implicit culture-sensitive methods

image

That being said this really needs to be out of the box, I have opened up this Roslyn issue here: https://github.com/dotnet/roslyn-analyzers/issues/4367

isaacabraham commented 3 years ago

@tarekgh Thanks for clarifying. My original point was that analysers are not the answer here if you are looking for a solution that works for all .NET users.

fschmied commented 3 years ago

We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

How about deprecating the old methods (using an [Obsolete] attribute)?