dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

Bizarre behavior of string.StartsWith #72770

Closed zvrba closed 2 years ago

zvrba commented 2 years ago

Description

Now, I'm aware of https://docs.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus

Please see the screenshot ("Immediate window" in VS debugger) and comments below.

StartsWithDebug

Reproduction Steps

Set locale to norwegian bokmål (NOB). "aa".StartsWith("a") returns false, which might be explainable with the breaking behavior I linked to above. However, "aa".StartsWith("å") returns false as well

Expected behavior

At least "aa".StartsWith("å") should then return true as "å" is "linguistically the same" as "aa". Otherwise, you tell me. The observed behavior totally breaks the expectation of a "string being a sequence of characters". It almost makes me want to replace all string types with List<char>.

Actual behavior

Please see the screenshots. Totally crazy, I spent two hours diagnosing the issue.

Regression?

No response

Known Workarounds

Explicitly use StringComparison.Ordinal. Alternately, set the program's culture to invariant, like this System.Globalization.CultureInfo.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;

Configuration

Windows 11, .net 6.0.5, x64. Mixed locale: english as display language, several keyboard layouts installed (ENG and NOB).

Other information

No response

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Now, I'm aware of https://docs.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus Please see the screenshot ("Immediate window" in VS debugger) and comments below. StartsWithDebug ### Reproduction Steps Set locale to norwegian bokmål (NOB). `"aa".StartsWith("a")` returns false, which _might_ be explainable with the breaking behavior I linked to above. However, `"aa".StartsWith("å")` returns false as well ### Expected behavior At least `"aa".StartsWith("å")` should then return true as "å" is "linguistically the same" as "aa". Otherwise, you tell me. The observed behavior totally breaks the expectation of a "string being a sequence of characters". ### Actual behavior Please see the screenshots. Totally crazy, I spent two hours diagnosing the issue. ### Regression? _No response_ ### Known Workarounds Explicitly use `StringComparison.Ordinal`. ### Configuration Windows 11, .net 6.0.5, x64. Mixed locale: english as display language, several keyboard layouts installed (ENG and NOB). ### Other information _No response_
Author: zvrba
Assignees: -
Labels: `area-System.Globalization`
Milestone: -
GalaxiaGuy commented 2 years ago

A simple repro in dotnetfiddle: https://dotnetfiddle.net/Y3jLQJ

tarekgh commented 2 years ago

@zvrba thanks for your report.

It is the Unicode collation behavior for the Norwegian culture that a is not a prefix for aa. If you disagree with this behavior, you may log a ticket to ICU https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/.

For å case, you are right this should be a prefix of aa. Part of the change when we switched to using ICU in the .NET is you need to supply the compare option to make this work. You can do the following:

            CultureInfo ci = CultureInfo.GetCultureInfo("nb-NO");
            Console.WriteLine(ci.CompareInfo.IsPrefix("aa", "å", CompareOptions.IgnoreNonSpace));

This should make everything work fine. Let me know if you have any more questions, I can help you with them.

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

zvrba commented 2 years ago

Let me know if you have any more questions, I can help you with them.

Hi. Thanks for the reply. I do have a question: I want string to behave as a sequence of "characters". What should I do? As an example, what should a program running under, say Korean culture, do to process French text, without knowing that the text is "French"? Two "visually same" strings should behave "sanely" wrt == , StartsWith and such, on char-by-char basis. I do not care about sort order, as long as it's consistent.

Also, I'm questioning the decision that StartsWith should use collation. I do not expect sorting rules (collation) to have effect when a method that works on partial strings is invoked.

Obviously, I'm not a unicode expert and really do not want to become one. The program I'm writing has to process Unicode strings but the processing should be neutral wrt user's OS locale. As another example, a person running the program under German locale should be able to "sanely" search (wrt == and StartsWith and Contains, etc.) for French names entered by a French person under French locale [1]. Data is exchanged through a NVARCHAR field in the database. What to do?

[1] Now, how does a German enter French characters under German locale/culture into the search box? Copy-paste!

EDIT: Another inconsistency. Look

"aax".StartsWith("a")
false
"aax".Contains("a")
true
"aax".IndexOf("a")
-1
"aax"[0] == 'a'
true

No matter how hard I try, I cannot make sense of this. (Yes, I know, there is an explanation. But the rather involved explanation does not match the programmer's expectations about how these methods should behave wrt each other. When IndexOf returns -1, Contains should return false as well, no? When "aax"[0] == 'a' returns true, StartsWith("a") should as well.)

GalaxiaGuy commented 2 years ago

To treat the string like an array of chars, use overloads with StringComparison.Ordinal:

"aa".StartsWith("a", StringComparison.Ordinal)

To treat strings in a way that is reasonably logical for English, use StringComparison.InvariantCulture.

"aa".StartsWith("a", StringComparison.InvariantCulture)
Joe4evr commented 2 years ago

When IndexOf returns -1, Contains should return false as well, no? When "aax"[0] == 'a' returns true, StartsWith("a") should as well.)

Sadly, that's not really the case. Here's a bit explained by Jon Skeet about how IndexOf can be problematic (and much of that extends to all string methods).

Really, the best advice when it comes to .NET string manipulation is and has always been: Never rely on a method's default behavior; Always supply a comparison type at your callsite (even if the supplied comparison matches that method's default) just so that you're clear and consistent and not getting surprising behavior like this. You can enable the code analysis rules CA1305 and CA1304 to help you catch those callsites and improve your code quality.

skyoxZ commented 2 years ago

I would suggest that string.StartsWith(string) (and some other methods of string) uses StringComparison.Ordinal by default. These are the most basic APIs but now their real behaviors are super complicated, especially for a beginner.

tarekgh commented 2 years ago

@skyoxZ please have a look at https://github.com/dotnet/designs/pull/207 for more info.