Closed kabua closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
The earlier issue is about contents comparison, and as mentioned, will not be changed for span.
// How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations? // The point is, they shouldn't have too.
They always have to. There are any number of cases where equality comparisons will unexpectedly fail; for instance, there are certain ranges at which int
won't convert to float
as you expect, and can cause errors (beyond some of the more usual float-specific shenanigans).
Note that reference vs value equality is something you have to examine per type anyways.
The two equalities are syntactically the same, and thus they should be semantically the same as well.
Sure, for this example, but that isn't true in the general case. Value equality for strings is complicated, especially outside of the standard ASCII range, and you are better served by being explicit about how you compare the contents. Even ignoring the common issue of case-insensitive comparisons, there are things like combining characters and odd whitespace that should sometimes be skipped. If you want a strict per-char/byte comparison, explicitly use that, but the results may not be what you expect...
Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.
Author: | kabua |
---|---|
Assignees: | - |
Labels: | `area-System.Memory`, `untriaged` |
Milestone: | - |
Spans can give you access to pieces of arrays, strings and unmanaged memory, but they are not arrays, strings and unmanaged memory.
var s2 = b2.AsSpan()[..^1]; var t2 = b == s2; // false.
// How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations?
Same here:
string b = "bob";
string b2 = "boby";
var s = b2[..^1];
var t = b == s; // true
object s2 = b2[..^1];
var t2 = b == s2; // false.
How is a developer to know that s2 is of type object and b is not without tracking down their type declarations?
I think the crux of the problem comes down to this, starting with .NET version 1.0, .NET has defined System.String
to be defined as a primitive value type and not as an array of bytes, even though technically it is, like everything else which is stored in computer memory.
As most developers know:
char
takes up 2 bytes of memory.short
takes up 2 bytes of memory.int
takes up 4 bytes of memory.long
takes up 8 bytes of memory.float
take up 4 bytes of memory.double
takes up 8 bytes of memory.string
takes up a variable length of bytes.In each of these cases, we view these objects
not as an array of bytes
but as an atomic concept: int
, long
, string
. As such, we do not expect the equality operator (==
) to compare "their pointer address and its length" but instead as a single conceptual value and very rarely as a sequence of bytes or characters, and a string
is no exception.
For example:
long a = 50;
long b = 50;
var t = a == b; // True
No one would ever expect to have to write a statement like this var t = a.SequenceEqual(b);
just because a long
is a sequence of 8 bytes. What matters is its value, not its pointer location or its memory footprint.
Or this example:
byte a = 50; // an array of 1 byte
double b = 50; // an array of 8 bytes.
var t = a == b; // true
According to the current Span logic, this should be false since:
Thank goodness for value semantics and not memory semantics. Otherwise, all of our code would be littered with constructs like var t = a.SequenceEqual(b)
; ops that does not work either - now what?
This same logic applies to a sub-string; we are interested in its value
and not its memory semantics. As you have been reading this post, you most likely have been reading a sequence of words (a string of characters) and not a s-e-q-u-e-n-c-e o-f i-n-d-i-v-i-d-u-a-l c-h-a-r-a-c-t-e-r-s (which is: "a sequence of individual characters"). Otherwise, the semantic meaning would be lost (as you just experinced) for the same reason why a child can know their ABCs but still not be able to read.
As @Clockwork-Muse so graciously pointed out:
Value equality for strings is complicated, especially outside of the standard ASCII range, and you are better served by being explicit about how you compare the contents. Even ignoring the common issue of case-insensitive comparisons, there are things like combining characters and odd whitespace that should sometimes be skipped. If you want a strict per-char/byte comparison, explicitly use that, but the results may not be what you expect...
Let me repeat this:
If you want a strict per-char/byte comparison, explicitly use that
Exactly! We do not want a strict per-char/byte comparison when comparing a string
to a ReadOnlySpan<char>
. Otherwise, we would explicitly spell it out.
Yes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.
I may be wrong, but I have a hard time believing that @Anders Hejlsbery, the senior language architect of C# and of TypeScript, would or has agreed with this current design choice.
I also found it very disheartening to see all the great work that has gone into recent versions of C#/.NET to support more value-based types, like Record
etc., only to blow it with ReadOnlySpan<char>
. 😞
Therefore, can we either:
string
to ReadOnlySpan<char>
operator ==
like public static bool operator ==(ReadOnlySpan<char> left, ReadOnlySpan<char> right)
which treats the span like a sub-string - a value type and not a reference type.
However, some developers may want the current functionality when processing a real array of char
s. Therefore, perhaps a better solution would be to:
2.1. Have the string implicitly converted to something like StringSpan
(which would be read-only, of course) or
2.2. Create two new sets of operators which take a (string left, ReadOnlySpan<char> right)
and (ReadOnlySpan<char> left, string right)
which DOES use value comparison and which follows the same rules as a string - thus no difference in usage or results.I may be wrong, but I have a hard time believing that anders Hejlsbery, the senior language architect of C# and of TypeScript, would or has agreed with this current design choice.
While Anders certainly helped start C#, it is not purely his language and he hasn't been the language architect for C# for many years now. Objectively speaking, the language has had much more overall evolution under the leadership of Mads Torgersen (lead designer for C#) and Jared Parsons (dev lead for C#) than anyone else.
Likewise, .NET extends far beyond the scope of just C# and the CLR does not standard for "C# Language Runtime". .NET exists as a "Common Language Runtime" for many languages including C#, VB, F#, C++/CLI, IronPython, COBOL.NET, etc...
These decisions are carefully considered within the scope of the entire .NET ecosystem. What works well, what doesn't work, patterns that developers should or shouldn't follow, etc. In the case of string, even using ==
is something you normally don't want. You should instead explicitly be using Equals(string left, string right, StringComparison comparisonType)
so you can correctly account for Ordinal vs CurrentCulture vs InvariantCulture
and CaseSensitive vs IgnoreCase
. The same applies to Compare
, CompareTo
, and most other string
APIs.
Therefore, can we either:
Neither 1 nor 2 are viable options. Span has shipped the way it has for several releases now and these would be a massively breaking change. An option that you didn't list is to update your code to call Equals
and to correctly pass in StringComparison
. We are looking at providing an analyzer (https://github.com/dotnet/runtime/issues/54794) to help flag potential misuse of an API where users potentially desired a string comparison instead.
The decision for ==
to behave the way it does was with taking into account that:
ROSpan<char>
is a string. It could also be a buffer over a char[]
ROSpan<T>
encompasses far more than just strings and character buffers. Having one behavior for char
and another for say byte
or float
would be massively confusing.string
is to explicitly pass in StringComparison
to ensure that the behavior you want is exactly what you get. We already have an unfortunate issue where not all APIs on string
behave the same and where sometimes you get ordinal
by default and other times you get culture-aware
, etcYes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.
To be more explicit about @tannergooding 's comment - in the general case for strings, ==
will produce unwanted or unexpected results. It only produces "correct"results for certain simple cases (mostly just English). If you're in a server context you should normally avoid it, in preference for additionally being explicit about the culture being used, in addition to the comparison needed.
In our company all string comparisons with ==
or !=
operators are forbidden, commits with such code patterns are blocked. Because as already stated above, such comparisons are almost always wrong and just by accident correct. string.Compare
is the only correct way.
Because of this, our check-in policy has uncovered many lazy-used-string-instead-of-enum patterns. And other invalid localization usages. In general, the key focus should be on clean code patterns, not asking to "fix the compiler" when there is nothing wrong. If the type of a variable is not obvious, then var
should not be used, and on top the name of a variable should be changed to make it obvious. var Count = "123";
I don't expect the compiler in such cases to "warn" me. Or prevent me from doing this: Count += 1;
And I especially don't want any "fixes" to make Count
then be the string 124
instead of 1231
.
For me, the Span==
operator does exactly what I expect it to do: check if both spans are identical, and not if they are just similar. Identical means, if I change one of them, the other changes too.
(Note: what do we do when we need a simple string check? we use pattern matching with is/is not. That defines a clean exception, visually distinct, and problems with Span
Yes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.
To be more explicit about @tannergooding 's comment - in the general case for strings,
==
will produce unwanted or unexpected results. It only produces "correct"results for certain simple cases (mostly just English). If you're in a server context you should normally avoid it, in preference for additionally being explicit about the culture being used, in addition to the comparison needed.
can add this code here so it doesn't happen again?
public static implicit operator ReadOnlySpan<T>(string str) => throw new Exception();
public static explicit operator ReadOnlySpan<T>(string str) => return ....
op_Equality()
is how the operator is known to IL/the runtime. I'm not sure what your point is?
Note that there are other characters in unicode that are considered digits, so under certain situations this isn't sufficient either...
Note that there are other characters in unicode that are considered digits, so under certain situations this isn't sufficient either...
Well, then it is necessary to compare by bytes :)
Well, then it is necessary to compare by bytes :)
No, what I'm getting at is that you might want to do a culture-aware comparison that considers culture-specific numeric digits, instead of only comparing for a specific single character. Or odder things, like accepting both the combining character and single-character versions of some glyphs.
Closing in favor of https://github.com/dotnet/runtime/issues/54794.
I ran into this problem which has cost my company a day's worth of work.
Digging deeper, I reviewed the implementation to
==
and was very shocked and horrified when I read thisWhat good is a span if its equality operator isn't transparent? Especially with strings, which I suspect will be the number 1 use-case for spans. In the case of strings, the current implementation does not pass the Duck Test but should.
Until this is fixed, we (all .NET loving developers around the world) will have to know explicitly if we are dealing with a span or not. As in, "To span or not to span, that is the question". Frankly, the answer should be, who cares? The two equalities are syntactically the same, and thus they should be semantically the same as well.
Whereas, comparing two arrays of ints, for example, doesn't currently have the same syntactical or semantic meaning. Thus as developers, we wouldn't expect element-wise but pointer comparison in this case.
This issue #10151, first addressed this problem back in 2018, which should have been fixed way back then.
@Microsoft: How many developers, and their hidden bugs, need to fall into this trap before this grievous issue is fixed? Or will this design decision go down in history as the next Billion Dollar Software Language Design Mistake?
Please fix this issue. Add a new project switch or some other backward-compatible flag, we don't care; we just want it fixed. PLEASE!
Call to all developers, if you agree or disagree, either way please vote on this issue.