JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.45k forks source link

AnnotatedString equality contradicts docs #55244

Open aplavin opened 1 month ago

aplavin commented 1 month ago

Docs:

  ==(x, y)

  Generic equality operator. Falls back to ===. Should be implemented for all types with a notion of equality, based on the abstract value
  that an instance represents. For example, all numeric types are compared by numeric value, ignoring type. Strings are compared as
  sequences of characters, ignoring encoding. 
<...>

  ==(a::AbstractString, b::AbstractString) -> Bool

  Test whether two strings are equal character by character (technically, Unicode code point by code point).

Actual behavior:

julia> s1 = Base.AnnotatedString("abc def", [(1:1, :x => :y)])
"abc def"

julia> s2 = "abc def"
"abc def"

julia> s1 == s2
false

Either the docs or the implementation are clearly wrong...

tecosaur commented 1 month ago

In your example s1 and s2 are not equal because the first character of s1 has an annotation. This lines up with "Test whether two strings are equal character by character", but not the "(technically... )" bit.

Initially, I didn't have annotations matter when comparing an AnnotatedString with an unannotated string, but that makes string equality non-transitive, which is a big no-no.

aplavin commented 1 month ago

Not sure how annotations affect code points (docs: "Unicode code point by code point"). Also, the straightforward interpretation of "Strings are compared as sequences of characters, ignoring encoding" (from the same excerpt) seems to imply comparing, well, characters themselves :) No matter how you encode it (with or without annotations) it still is the same character, right?

  The AbstractChar type is the supertype of all character implementations in
  Julia. A character represents a Unicode code point, and can be converted to
  an integer via the codepoint function in order to obtain the numerical value
  of the code point, or constructed from the same integer. These numerical
  values determine how characters are compared with < and ==, for example. New
  T <: AbstractChar types should define a codepoint(::T) method and a
  T(::UInt32) constructor, at minimum

that makes string equality non-transitive, which is a big no-no

Sure! I expected == to ignore annotations, following all these explanations in the docs.

tecosaur commented 1 month ago

Not sure how annotations affect code points

They don't, but the reasoning I put forwards is predicated on a character with annotations not being equal to a character without annotations.

Sure! I expected == to ignore annotations, following all these explanations in the docs.

We could do this, but I'd expect this behavior to be potentially surprising.

I think the situation can be summed up like so:

Current behavior

A string without annotations is equal to an unannotated string.

Annotated strings are equal when the underlying string is equal, as are annotations.

Option 1

Loosen the "technically" to something like "often" in the ==(::AbstractString, ::AbstractString) docstring.

Option 2

Ignore annotations entirely when comparing AnnotatedStrings and other strings with ==.

fatteneder commented 1 month ago

Option 3

Just document

==(a::AnnotatedString, b::AnnotatedString) -> Bool

  Test whether their strings and annotations are equal.
tecosaur commented 1 month ago

Ah yep, though maybe that would need to explicitly mention

==(a::AnnotatedString, b::AnnotatedString)
==(a::AbstractString, b::AnnotatedString)
==(a::AnnotatedString, b::AbstractString)
aplavin commented 1 month ago

Yeah the current situtation is definitely weird:

julia> a = Base.AnnotatedString("abc", [(2:3, :x => 123)])
"abc"

julia> a[1] == 'a' && a[2] == 'b' && a[3] == 'c'
true

julia> a == "abc"
false
tecosaur commented 3 weeks ago

Ahh, I think that the AnnotatedChar equality should be updated to match.

nhz2 commented 3 weeks ago

Option 4

Don't make AnnotatedString a subtype of AbstractString

This way AnnotatedString is free to define whatever equality methods it wants without being constrained by the existing AbstractString interface.

tecosaur commented 3 weeks ago

Half the point of AnnotatedString is that it can be passed to functions that are written for an AbstractString and don't know what to do with annotations and it "just works", automatically falling back to a plain un-annotated string :upside_down_face:

nhz2 commented 3 weeks ago

The issue is that as a Julia user, I have no idea which functions that use AbstractString are going to suddenly have different behavior when passed an AnnotatedString because of some strange overload or piracy. If you want a function to have fundamentally different behavior for AnnotatedString why not just give the function a new name, for example, instead of == define isequal_annotated(a::AbstractString, b::AbstractString) = a == b && annotations(a) == annotations(b)

And define a fall back: annotations(::AbstractString) = Tuple{UnitRange{Int64}, Pair{Symbol, Any}}[]

tecosaur commented 3 weeks ago

Mmmm, I think there's an argument to be made for abandoning the inclusion of annotations in the concept of string equality, and require them to be checked separately if you do care.

I'm not sure if this is the path we should take, but that is the alternative to the current situation.