dotnet / core

.NET news, announcements, release notes, and more!
https://dot.net
MIT License
21k stars 4.91k forks source link

Text ordering in .Net core #7668

Closed armandsuiska closed 2 years ago

armandsuiska commented 2 years ago

I have problem with text ordering in .Net core.

List.Sort() or linq.OrderBy() is not ordering utf8 chars correctly.

Here is simple .Net core 6 console application:

using System;
using System.Collections.Generic;

namespace Tests
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Console.OutputEncoding = System.Text.Encoding.UTF8;
            var words = new List<string>() { "Ādb", "Aug", "Ārz", "Alū", "Ada", "Aiz", "Āda", "Ārv" };
            words.Sort();
            Console.WriteLine(String.Join(", ", words));
            words = new List<string>() { "Čdb", "Cug", "Črz", "Clū", "Cda", "Ciz", "Čda", "Črv" };
            words.Sort();
            Console.WriteLine(String.Join(", ", words));
            Console.ReadKey();
        }
    }
}

Output:

Ada,Āda,Ādb,Aiz,Alū,Ārv,Ārz,Aug
Cda,Ciz,Clū,Cug,Čda,Čdb,Črv,Črz

I see that letters C,Č are ordered correctly, but A,Ā are not. And it is ordered correctly in .Net Framework 4.8 - it feels like bug to me - can someone check it and confirm it?

pratikkabade commented 2 years ago

image this is what I got, seems like a big itself

bartonjs commented 2 years ago

All of our culture-aware data on Linux comes from https://icu.unicode.org/home.

If you put the input data here into their online tool at https://icu4c-demos.unicode.org/icu-bin/collation.html, you'll see the same output:

input:

Ādb
Aug
Ārz
Alū
Ada
Aiz
Āda
Ārv
Cda
Ciz
Clū
Cug
Čda
Čdb
Črv
Črz

output:

<1 [5] Ada
<2 [7] Āda
<1 [1] Ādb
<1 [6] Aiz
<1 [4] Alū
<1 [8] Ārv
<1 [3] Ārz
<1 [2] Aug
<1 [9] Cda
<2 [13] Čda
<1 [14] Čdb
<1 [10] Ciz
<1 [11] Clū
<1 [15] Črv
<1 [16] Črz
<1 [12] Cug

But if you switch the locale to cs (Czech), it becomes

 <1 [5] Ada
<2 [7] Āda
<1 [1] Ādb
<1 [6] Aiz
<1 [4] Alū
<1 [8] Ārv
<1 [3] Ārz
<1 [2] Aug
<1 [9] Cda
<1 [10] Ciz
<1 [11] Clū
<1 [12] Cug
<1 [13] Čda
<1 [14] Čdb
<1 [15] Črv
<1 [16] Črz

So this looks like just a difference between Windows NLS data and Linux ICU data; but that's out of .NET's hands. https://docs.microsoft.com/en-us/dotnet/core/extensions/globalization-icu