dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

string.IndexOf bug when using Thai culture #75616

Closed dnickless closed 2 years ago

dnickless commented 2 years ago

How to reproduce:

Create a net472 Console project and paste the following code:

using System;
using System.Globalization;
using System.Threading;
public class Program
{
    public static void Main()
    {
        Thread.CurrentThread.CurrentCulture = new CultureInfo("th-th");
        Console.WriteLine("#".IndexOf("["));
    }
}

Output (as expected): -1

Switch the .csproj to net6 TFM and run again.

Output (not expected): 0

gfoidl commented 2 years ago

Most likely: Globalization APIs use ICU libraries on Windows 10 (starting with .NET 5).

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
How to reproduce: Create a **net472** Console project and paste the following code: ``` using System; using System.Globalization; using System.Threading; public class Program { public static void Main() { Thread.CurrentThread.CurrentCulture = new CultureInfo("th-th"); Console.WriteLine("#".IndexOf("[")); } } ``` Output (as expected): **-1** Switch the .csproj to **net6** TFM and run again. Output (not expected): **0**
Author: dnickless
Assignees: -
Labels: `area-System.Globalization`
Milestone: -
tarekgh commented 2 years ago

@dnickless

Thai language has specific collation behavior which you are seeing here. It treats some characters like #, [, -,...etc. as zero sort weight. Which means these characters will be as if they do not exist at all in the string. Therefore, you are getting 0 because of that. Starting from .NET 5, .NET switched to use the ICU library for globalization support to be more conformant to the Unicode Standard. If you don't really need the linguistic behavior of your string search, you may call IndexOf with the StringComparison.Ordinal option. If you want to revert to the old behavior of the search (which we don't recommend), you may use the way described in the doc.

I am closing the issue but feel free to send any question and we'll be happy to help answering it.

dnickless commented 2 years ago

@tarekgh, thanks for the explanation which makes a lot of sense. Thanks even more for the workaround which we will need to apply since we do not own the problematic source code (https://github.com/ClosedXML/ClosedXML/blob/78150efbbd4a36d65e95ef3c793f12feb12c1a9c/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).

I realize that you've answered very similar questions already here: https://github.com/dotnet/runtime/issues/43772 and here https://developercommunity.visualstudio.com/t/stringstartswith-and-stringendwith-returns-wrong-v/1218489.

...and I suspect that tons of applications will stop functioning in Thailand (and probably elsewhere, too) as we speak due to this change...

In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): https://www.utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x

tarekgh commented 2 years ago

Thanks @dnickless for the feedback.

we do not own the problematic source code (ClosedXML/ClosedXML@78150ef/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).

I have opened issue for such library to get this fixed in their side. https://github.com/ClosedXML/ClosedXML/issues/1862. If you see similar issues in some other places, I suggest you open issues for such cases or contact us and we can follow up.

In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x

Unicode lists different languages, and it is not necessary to add all ascii characters to the language character list. But the collation for this language decides what would be the behavior when using such characters from the ascii range.