dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.71k forks source link

String == ReadOnlySpan<char> is FALSE; Really Bad Design and Needs to be FIX #76735

Closed kabua closed 2 years ago

kabua commented 2 years ago

I ran into this problem which has cost my company a day's worth of work.

string b = "bob";
string b2 = "boby";

var s = b2[..^1];
var t = b == s;     // true

var s2 = b2.AsSpan()[..^1];
var t2 = b == s2; // false. 

// How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations? 
// The point is, they shouldn't have too.

Digging deeper, I reviewed the implementation to == and was very shocked and horrified when I read this

        /// <summary>
        /// Returns true if left and right point at the same memory and have the same length.  Note that
        /// this does *not* check to see if the *contents* are equal.
        /// </summary>
        public static bool operator ==(ReadOnlySpan<T> left, ReadOnlySpan<T> right) =>
            left._length == right._length &&
            Unsafe.AreSame<T>(ref left._pointer.Value, ref right._pointer.Value);

What good is a span if its equality operator isn't transparent? Especially with strings, which I suspect will be the number 1 use-case for spans. In the case of strings, the current implementation does not pass the Duck Test but should.

Until this is fixed, we (all .NET loving developers around the world) will have to know explicitly if we are dealing with a span or not. As in, "To span or not to span, that is the question". Frankly, the answer should be, who cares? The two equalities are syntactically the same, and thus they should be semantically the same as well.

Whereas, comparing two arrays of ints, for example, doesn't currently have the same syntactical or semantic meaning. Thus as developers, we wouldn't expect element-wise but pointer comparison in this case.

This issue #10151, first addressed this problem back in 2018, which should have been fixed way back then.

@Microsoft: How many developers, and their hidden bugs, need to fall into this trap before this grievous issue is fixed? Or will this design decision go down in history as the next Billion Dollar Software Language Design Mistake?

Please fix this issue. Add a new project switch or some other backward-compatible flag, we don't care; we just want it fixed. PLEASE!

Call to all developers, if you agree or disagree, either way please vote on this issue.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

Clockwork-Muse commented 2 years ago

The earlier issue is about contents comparison, and as mentioned, will not be changed for span.

// How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations? // The point is, they shouldn't have too.

They always have to. There are any number of cases where equality comparisons will unexpectedly fail; for instance, there are certain ranges at which int won't convert to float as you expect, and can cause errors (beyond some of the more usual float-specific shenanigans). Note that reference vs value equality is something you have to examine per type anyways.

The two equalities are syntactically the same, and thus they should be semantically the same as well.

Sure, for this example, but that isn't true in the general case. Value equality for strings is complicated, especially outside of the standard ASCII range, and you are better served by being explicit about how you compare the contents. Even ignoring the common issue of case-insensitive comparisons, there are things like combining characters and odd whitespace that should sometimes be skipped. If you want a strict per-char/byte comparison, explicitly use that, but the results may not be what you expect...

hrrrrustic commented 2 years ago

54794

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
I ran into this problem which has cost my company a day's worth of work. ``` string b = "bob"; string b2 = "boby"; var s = b2[..^1]; var t = b == s; // true var s2 = b2.AsSpan()[..^1]; var t2 = b == s2; // false. // How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations? // The point is, they shouldn't have too. ``` Digging deeper, I reviewed the implementation to `==` and was very shocked and **horrified** when I read this ``` /// /// Returns true if left and right point at the same memory and have the same length. Note that /// this does *not* check to see if the *contents* are equal. /// public static bool operator ==(ReadOnlySpan left, ReadOnlySpan right) => left._length == right._length && Unsafe.AreSame(ref left._pointer.Value, ref right._pointer.Value); ``` What good is a span if its equality operator isn't transparent? Especially with strings, which I suspect will be the number 1 use-case for spans. In the case of strings, the current implementation does not pass the [Duck Test](https://en.wikipedia.org/wiki/Duck_test) but should. Until this is fixed, we (all .NET loving developers around the world) will have to know explicitly if we are dealing with a span or not. As in, "To span or not to span, that is the question". Frankly, the answer should be, who cares? The two equalities are _syntactically_ the same, and thus they should be _semantically_ the same as well. Whereas, comparing two arrays of ints, for example, doesn't currently have the same _syntactical_ or _semantic_ meaning. Thus as developers, we wouldn't expect element-wise comparison but pointer comparison in this case. This issue #10151, first addressed this problem with string, which should have been fixed way back then. @Microsoft: How many developers, and their hidden bugs, need to fall into this trap before this grievous issue is fixed? Or will this design decision go down in history as the next [Billion Dollar Software Language Design Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/)? Please fix this issue. Add a new project switch or some other backward-compatible flag, we don't care; we just want it fixed. PLEASE! Call to all developers, if you agree or disagree, either way please vote on this issue.
Author: kabua
Assignees: -
Labels: `area-System.Memory`, `untriaged`
Milestone: -
omariom commented 2 years ago

Spans can give you access to pieces of arrays, strings and unmanaged memory, but they are not arrays, strings and unmanaged memory.

var s2 = b2.AsSpan()[..^1]; var t2 = b == s2; // false.

// How is a developer to know that s2 is of type Span and b is not without tracking down their type declarations?

Same here:

string b = "bob";
string b2 = "boby";

var s = b2[..^1];
var t = b == s;     // true

object s2 = b2[..^1];
var t2 = b == s2; // false.

How is a developer to know that s2 is of type object and b is not without tracking down their type declarations?

kabua commented 2 years ago

I think the crux of the problem comes down to this, starting with .NET version 1.0, .NET has defined System.String to be defined as a primitive value type and not as an array of bytes, even though technically it is, like everything else which is stored in computer memory.

As most developers know:

In each of these cases, we view these objects not as an array of bytes but as an atomic concept: int, long, string. As such, we do not expect the equality operator (==) to compare "their pointer address and its length" but instead as a single conceptual value and very rarely as a sequence of bytes or characters, and a string is no exception.

For example:

long a = 50;
long b = 50;

var t = a == b;  // True

No one would ever expect to have to write a statement like this var t = a.SequenceEqual(b); just because a long is a sequence of 8 bytes. What matters is its value, not its pointer location or its memory footprint.

Or this example:

byte a = 50;    // an array of 1 byte
double b = 50;  // an array of 8 bytes.

var t = a == b; // true

According to the current Span logic, this should be false since:

  1. The memory addresses are not the same and
  2. Their lengths are definitely not equal

Thank goodness for value semantics and not memory semantics. Otherwise, all of our code would be littered with constructs like var t = a.SequenceEqual(b); ops that does not work either - now what?

This same logic applies to a sub-string; we are interested in its value and not its memory semantics. As you have been reading this post, you most likely have been reading a sequence of words (a string of characters) and not a s-e-q-u-e-n-c-e o-f i-n-d-i-v-i-d-u-a-l c-h-a-r-a-c-t-e-r-s (which is: "a sequence of individual characters"). Otherwise, the semantic meaning would be lost (as you just experinced) for the same reason why a child can know their ABCs but still not be able to read.

As @Clockwork-Muse so graciously pointed out:

Value equality for strings is complicated, especially outside of the standard ASCII range, and you are better served by being explicit about how you compare the contents. Even ignoring the common issue of case-insensitive comparisons, there are things like combining characters and odd whitespace that should sometimes be skipped. If you want a strict per-char/byte comparison, explicitly use that, but the results may not be what you expect...

Let me repeat this:

If you want a strict per-char/byte comparison, explicitly use that

Exactly! We do not want a strict per-char/byte comparison when comparing a string to a ReadOnlySpan<char>. Otherwise, we would explicitly spell it out.

Yes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.

I may be wrong, but I have a hard time believing that @Anders Hejlsbery, the senior language architect of C# and of TypeScript, would or has agreed with this current design choice.

I also found it very disheartening to see all the great work that has gone into recent versions of C#/.NET to support more value-based types, like Record etc., only to blow it with ReadOnlySpan<char>. 😞

Therefore, can we either:

  1. Remove the implicit conversion of string to ReadOnlySpan<char>
  2. Create a new operator == like public static bool operator ==(ReadOnlySpan<char> left, ReadOnlySpan<char> right) which treats the span like a sub-string - a value type and not a reference type. However, some developers may want the current functionality when processing a real array of chars. Therefore, perhaps a better solution would be to: 2.1. Have the string implicitly converted to something like StringSpan (which would be read-only, of course) or 2.2. Create two new sets of operators which take a (string left, ReadOnlySpan<char> right) and (ReadOnlySpan<char> left, string right) which DOES use value comparison and which follows the same rules as a string - thus no difference in usage or results.
  3. Or, at the very least, give us a Roselyn Analyser, as #54794 has suggested.
tannergooding commented 2 years ago

I may be wrong, but I have a hard time believing that anders Hejlsbery, the senior language architect of C# and of TypeScript, would or has agreed with this current design choice.

While Anders certainly helped start C#, it is not purely his language and he hasn't been the language architect for C# for many years now. Objectively speaking, the language has had much more overall evolution under the leadership of Mads Torgersen (lead designer for C#) and Jared Parsons (dev lead for C#) than anyone else.

Likewise, .NET extends far beyond the scope of just C# and the CLR does not standard for "C# Language Runtime". .NET exists as a "Common Language Runtime" for many languages including C#, VB, F#, C++/CLI, IronPython, COBOL.NET, etc...

These decisions are carefully considered within the scope of the entire .NET ecosystem. What works well, what doesn't work, patterns that developers should or shouldn't follow, etc. In the case of string, even using == is something you normally don't want. You should instead explicitly be using Equals(string left, string right, StringComparison comparisonType) so you can correctly account for Ordinal vs CurrentCulture vs InvariantCulture and CaseSensitive vs IgnoreCase. The same applies to Compare, CompareTo, and most other string APIs.

Therefore, can we either:

Neither 1 nor 2 are viable options. Span has shipped the way it has for several releases now and these would be a massively breaking change. An option that you didn't list is to update your code to call Equals and to correctly pass in StringComparison. We are looking at providing an analyzer (https://github.com/dotnet/runtime/issues/54794) to help flag potential misuse of an API where users potentially desired a string comparison instead.

The decision for == to behave the way it does was with taking into account that:

  1. Not every ROSpan<char> is a string. It could also be a buffer over a char[]
  2. ROSpan<T> encompasses far more than just strings and character buffers. Having one behavior for char and another for say byte or float would be massively confusing.
  3. The typical and recommended use-case for string is to explicitly pass in StringComparison to ensure that the behavior you want is exactly what you get. We already have an unfortunate issue where not all APIs on string behave the same and where sometimes you get ordinal by default and other times you get culture-aware, etc
  4. There were also many more considerations taken into account and it would take far too long to list them all.
Clockwork-Muse commented 2 years ago

Yes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.

To be more explicit about @tannergooding 's comment - in the general case for strings, == will produce unwanted or unexpected results. It only produces "correct"results for certain simple cases (mostly just English). If you're in a server context you should normally avoid it, in preference for additionally being explicit about the culture being used, in addition to the comparison needed.

hopperpl commented 2 years ago

In our company all string comparisons with == or != operators are forbidden, commits with such code patterns are blocked. Because as already stated above, such comparisons are almost always wrong and just by accident correct. string.Compare is the only correct way.

Because of this, our check-in policy has uncovered many lazy-used-string-instead-of-enum patterns. And other invalid localization usages. In general, the key focus should be on clean code patterns, not asking to "fix the compiler" when there is nothing wrong. If the type of a variable is not obvious, then var should not be used, and on top the name of a variable should be changed to make it obvious. var Count = "123"; I don't expect the compiler in such cases to "warn" me. Or prevent me from doing this: Count += 1; And I especially don't want any "fixes" to make Count then be the string 124 instead of 1231.

For me, the Span == operator does exactly what I expect it to do: check if both spans are identical, and not if they are just similar. Identical means, if I change one of them, the other changes too.

(Note: what do we do when we need a simple string check? we use pattern matching with is/is not. That defines a clean exception, visually distinct, and problems with Span cannot occur.)

Mr0N commented 2 years ago

Yes, strings (a variable-length array of chars) are complicated, and yet .NET still deals with them 'correctly.' We are not looking for a "general case" here, we are explicitly talking about strings, nothing more.

To be more explicit about @tannergooding 's comment - in the general case for strings, == will produce unwanted or unexpected results. It only produces "correct"results for certain simple cases (mostly just English). If you're in a server context you should normally avoid it, in preference for additionally being explicit about the culture being used, in addition to the comparison needed.

image

Mr0N commented 2 years ago
  can add this code here so it doesn't happen again?

https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/ReadOnlySpan.cs

  public static implicit operator ReadOnlySpan<T>(string str) =>  throw new Exception();
        public static explicit operator ReadOnlySpan<T>(string str) =>  return ....
Clockwork-Muse commented 2 years ago

image

op_Equality() is how the operator is known to IL/the runtime. I'm not sure what your point is?

Note that there are other characters in unicode that are considered digits, so under certain situations this isn't sufficient either...

Mr0N commented 2 years ago

Note that there are other characters in unicode that are considered digits, so under certain situations this isn't sufficient either...

Well, then it is necessary to compare by bytes :)

Clockwork-Muse commented 2 years ago

Well, then it is necessary to compare by bytes :)

No, what I'm getting at is that you might want to do a culture-aware comparison that considers culture-specific numeric digits, instead of only comparing for a specific single character. Or odder things, like accepting both the combining character and single-character versions of some glyphs.

jozkee commented 2 years ago

Closing in favor of https://github.com/dotnet/runtime/issues/54794.