dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.9k stars 4.63k forks source link

String comparisons with the CompareOptions.StringSort value produce incorrect results under .NET 5 and later #102579

Open daverayment opened 3 months ago

daverayment commented 3 months ago

Description

When comparing strings, CompareOptions.StringSort should apply low sort weights to hyphens and other non-alphanumeric characters. This works in .NET Framework projects. In .NET 5 and later, however, the weightings are not applied and the results of sorting with CompareOptions.StringSort are the same as when CompareOptions.None is chosen.

Note: I am using the default ICU Unicode processing for .NET 5+ testing.

Reproduction Steps

This code is adapted from the CompareOptions Enum documentation page here. The word list has been copied verbatim.

using System;
using System.Collections.Generic;
using System.Globalization;

public class SamplesCompareOptions
{
    public static void Main()
    {
        var wordList = new List<string> { "cant", "bill's", "coop", "cannot", "billet", "can't", "con", "bills", "co-op" };

        wordList.Sort((x, y) => string.Compare(x, y, CultureInfo.CurrentCulture, CompareOptions.None));
        Console.WriteLine("\nAfter default sort (CompareOptions.None):");
        foreach (string word in wordList)
        {
            Console.WriteLine(word);
        }

        wordList.Sort((x, y) => string.Compare(x, y, CultureInfo.CurrentCulture, CompareOptions.StringSort));
        Console.WriteLine("\nAfter sorting with CompareOptions.StringSort:");
        foreach (string word in wordList)
        {
            Console.WriteLine(word);
        }
    }
}

DotNetFiddle for the code here.

Expected behavior

The CompareOptions.StringSort should apply a correct weighted ordering to the unordered collection of strings. The results are correct in .NET Framework 4.7.2 and Roslyn 4.8:

After default sort (CompareOptions.None):
billet
bills
bill's
cannot
cant
can't
con
coop
co-op

After sorting with CompareOptions.StringSort:
bill's
billet
bills
can't
cannot
cant
co-op
con
coop

Actual behavior

In .NET 5 and later, CompareOptions.StringSort is incorrect, producing the same results as CompareOptions.None:

After default sort (CompareOptions.None):
bill's
billet
bills
can't
cannot
cant
co-op
con
coop

After sorting with CompareOptions.StringSort:
bill's
billet
bills
can't
cannot
cant
co-op
con
coop

Regression?

According to testing on dotnetfiddle.net, the correct results were produced in .NET Framework 4.7.2 and Roslyn 4.8. .NET 5 and later produce the incorrect sort order.

Known Workarounds

A potential workaround may be to switch from ICU to NLS, but I have not tested this.

Configuration

My system:

I don't think the issue is specific to my OS or architecture, as the same problem can be seen via dotnetfiddle.

Other information

No response

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

tarekgh commented 3 months ago

In .NET 5.0 and later, we switched to using the ICU library. For more information, please refer to this article.

You may notice some behavioral differences between the legacy NLS (used in .NET Framework) and ICU. In ICU, the StringSort behavior is enabled by default, rendering the StringSort option ineffective. This default setting is why you consistently see the following order:

bill's
billet
bills
can't
cannot
cant
co-op
con
coop

This behavior is explained in the comment in the code here. We do not plan to change this behavior in the future as we adhere to ICU behavior, which aligns with the Unicode Standard.

We may add some information about this specific case in the documentation in the article.

daverayment commented 3 months ago

@tarekgh Thank you for the quick response.

Sorry, I do see now that the StringSort option is being applied by default in .NET 5+ rather than not being applied at all.

This still means the CompareOptions documentation is incorrect for .NET 5 and later. The example code says to expect different outputs for None and StringSort options.

I will raise a separate documentation issue for that page and refer back here. I also thank you for suggesting an update to the ICU article to mention the CompareOptions change - that would be very useful, as I read that article myself while trying to troubleshoot.

Thanks again!

daverayment commented 3 months ago

I've raised a new documentation issue for the CompareOptions enum page: https://github.com/dotnet/docs/issues/41052