Closed dnickless closed 2 years ago
Most likely: Globalization APIs use ICU libraries on Windows 10 (starting with .NET 5).
Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.
Author: | dnickless |
---|---|
Assignees: | - |
Labels: | `area-System.Globalization` |
Milestone: | - |
@dnickless
Thai language has specific collation behavior which you are seeing here. It treats some characters like #
, [
, -
,...etc. as zero sort weight. Which means these characters will be as if they do not exist at all in the string. Therefore, you are getting 0
because of that.
Starting from .NET 5, .NET switched to use the ICU library for globalization support to be more conformant to the Unicode Standard. If you don't really need the linguistic behavior of your string search, you may call IndexOf
with the StringComparison.Ordinal
option. If you want to revert to the old behavior of the search (which we don't recommend), you may use the way described in the doc.
I am closing the issue but feel free to send any question and we'll be happy to help answering it.
@tarekgh, thanks for the explanation which makes a lot of sense. Thanks even more for the workaround which we will need to apply since we do not own the problematic source code (https://github.com/ClosedXML/ClosedXML/blob/78150efbbd4a36d65e95ef3c793f12feb12c1a9c/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).
I realize that you've answered very similar questions already here: https://github.com/dotnet/runtime/issues/43772 and here https://developercommunity.visualstudio.com/t/stringstartswith-and-stringendwith-returns-wrong-v/1218489.
...and I suspect that tons of applications will stop functioning in Thailand (and probably elsewhere, too) as we speak due to this change...
In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): https://www.utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x
Thanks @dnickless for the feedback.
we do not own the problematic source code (ClosedXML/ClosedXML@78150ef/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).
I have opened issue for such library to get this fixed in their side. https://github.com/ClosedXML/ClosedXML/issues/1862. If you see similar issues in some other places, I suggest you open issues for such cases or contact us and we can follow up.
In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x
Unicode lists different languages, and it is not necessary to add all ascii characters to the language character list. But the collation for this language decides what would be the behavior when using such characters from the ascii range.
How to reproduce:
Create a net472 Console project and paste the following code:
Output (as expected): -1
Switch the .csproj to net6 TFM and run again.
Output (not expected): 0