Closed zvrba closed 2 years ago
Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.
Author: | zvrba |
---|---|
Assignees: | - |
Labels: | `area-System.Globalization` |
Milestone: | - |
A simple repro in dotnetfiddle: https://dotnetfiddle.net/Y3jLQJ
@zvrba thanks for your report.
It is the Unicode collation behavior for the Norwegian culture that a
is not a prefix for aa
. If you disagree with this behavior, you may log a ticket to ICU https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/.
For å
case, you are right this should be a prefix of aa
. Part of the change when we switched to using ICU in the .NET is you need to supply the compare option to make this work. You can do the following:
CultureInfo ci = CultureInfo.GetCultureInfo("nb-NO");
Console.WriteLine(ci.CompareInfo.IsPrefix("aa", "å", CompareOptions.IgnoreNonSpace));
This should make everything work fine. Let me know if you have any more questions, I can help you with them.
This issue has been marked needs-author-action
and may be missing some important information.
Let me know if you have any more questions, I can help you with them.
Hi. Thanks for the reply. I do have a question: I want string
to behave as a sequence of "characters". What should I do? As an example, what should a program running under, say Korean culture, do to process French text, without knowing that the text is "French"? Two "visually same" strings should behave "sanely" wrt ==
, StartsWith
and such, on char-by-char basis. I do not care about sort order, as long as it's consistent.
Also, I'm questioning the decision that StartsWith
should use collation. I do not expect sorting rules (collation) to have effect when a method that works on partial strings is invoked.
Obviously, I'm not a unicode expert and really do not want to become one. The program I'm writing has to process Unicode strings but the processing should be neutral wrt user's OS locale. As another example, a person running the program under German locale should be able to "sanely" search (wrt ==
and StartsWith
and Contains
, etc.) for French names entered by a French person under French locale [1]. Data is exchanged through a NVARCHAR
field in the database. What to do?
[1] Now, how does a German enter French characters under German locale/culture into the search box? Copy-paste!
EDIT: Another inconsistency. Look
"aax".StartsWith("a")
false
"aax".Contains("a")
true
"aax".IndexOf("a")
-1
"aax"[0] == 'a'
true
No matter how hard I try, I cannot make sense of this. (Yes, I know, there is an explanation. But the rather involved explanation does not match the programmer's expectations about how these methods should behave wrt each other. When IndexOf
returns -1, Contains
should return false as well, no? When "aax"[0] == 'a'
returns true, StartsWith("a")
should as well.)
To treat the string like an array of chars, use overloads with StringComparison.Ordinal
:
"aa".StartsWith("a", StringComparison.Ordinal)
To treat strings in a way that is reasonably logical for English, use StringComparison.InvariantCulture
.
"aa".StartsWith("a", StringComparison.InvariantCulture)
When
IndexOf
returns -1,Contains
should return false as well, no? When"aax"[0] == 'a'
returns true,StartsWith("a")
should as well.)
Sadly, that's not really the case. Here's a bit explained by Jon Skeet about how IndexOf
can be problematic (and much of that extends to all string methods).
Really, the best advice when it comes to .NET string manipulation is and has always been: Never rely on a method's default behavior; Always supply a comparison type at your callsite (even if the supplied comparison matches that method's default) just so that you're clear and consistent and not getting surprising behavior like this. You can enable the code analysis rules CA1305 and CA1304 to help you catch those callsites and improve your code quality.
I would suggest that string.StartsWith(string)
(and some other methods of string) uses StringComparison.Ordinal
by default. These are the most basic APIs but now their real behaviors are super complicated, especially for a beginner.
@skyoxZ please have a look at https://github.com/dotnet/designs/pull/207 for more info.
Description
Now, I'm aware of https://docs.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus
Please see the screenshot ("Immediate window" in VS debugger) and comments below.
Reproduction Steps
Set locale to norwegian bokmål (NOB).
"aa".StartsWith("a")
returns false, which might be explainable with the breaking behavior I linked to above. However,"aa".StartsWith("å")
returns false as wellExpected behavior
At least
"aa".StartsWith("å")
should then return true as "å" is "linguistically the same" as "aa". Otherwise, you tell me. The observed behavior totally breaks the expectation of a "string being a sequence of characters". It almost makes me want to replace allstring
types withList<char>
.Actual behavior
Please see the screenshots. Totally crazy, I spent two hours diagnosing the issue.
Regression?
No response
Known Workarounds
Explicitly use
StringComparison.Ordinal
. Alternately, set the program's culture to invariant, like thisSystem.Globalization.CultureInfo.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;
Configuration
Windows 11, .net 6.0.5, x64. Mixed locale: english as display language, several keyboard layouts installed (ENG and NOB).
Other information
No response