Open eernstg opened 4 years ago
The phrasing of "string canonicalization" has always been vague. In fact, the specification does not require canonicalization, it only requires that identical(c1, c2)
behaves in a specific way if c1
and c2
are constant expressions (so the identical
invocation is a constant invocation). There is nothing technically prohibiting:
const a = "s";
const b = "s";
print(identical(a, b)); // must be true because `identical(a, b)` is a constant expression
var c = a;
print(identical(c, b)); // false. Not bound by the specification.
We wouldn't want that, but the specification allows it because it doesn't talk about canonicalization of strings at all.
If we read the specification as implicitly defining canonicalization of strings through their behavior wrt identical
, then it only states that compile-time constant string expressions must canonicalized their values. (At least if that value reaches a constant identical
invocation).
I believe the current VM behavior is that they canonicalize all constant string literals, and they canonicalize the value of constant string expressions in constant contexts. That is const s = "a" "b";
will be canonicalized and var s = "a" "b";
will not. A single literal like "ab"
is always canonicalized.
The canonicalization is part of constant evaluation, and the VM does not handle "constant expressions" like "a" + "b"
that are not required to be constant in the constant evaluation.
This does ensure that const ["a" "b"]
and const ["ab"]
become the same object. Any string which is part of a composite constant object will be canonicalized.
It just doesn't guarantee that the element of that list is identical to var x = "a" "b";
.
The JS compilation is limited by the underlying platform. A JS program cannot distinguish two strings with the same content in any way, so they effectively canonicalize all string values. We do not want the VM to do the same, so we will definitely need to allow more canonicalization than what the specification requires.
We should probably specify the exact behavior that we require.
If we go for the current VM behavior, then we will have to change the definition of compile-time constant expressions to only include those that need to be constant (it's based on context, not syntax), rather than promising that 1 + 2
is a compile-time constant expression even when it's not needed to be.
I think that's actually an improvement.
A compile-time constant expression is:
- The initializer expression of a constant variable declaration.
- A list, map, or set literal preceded by
const
.- A constructor invocation preceded by
const
or@
.- The expression of a
case
clause.- Any sub-expression of a compile-time constant expression.
It is a compile-time error if a compile-time constant expression is not one of .... (listing all the currently approved forms).
Then we can say that:
If two compile-time constant expressions evaluate to instances of
String
with the same content, they evaluate to the same instance. That is, constant strings are canonicalized. Strings created by non-constant expressions may be canonicalized, but they are not required to be so. (Compilation to JavaScript effectively canonicalize all strings that are equal.)
@eernstg
We haven't specified the behavior of operator == on strings
The API reference seems to show the behavior of operator ==
on String
https://api.dart.dev/stable/2.8.3/dart-core/String/operator_equals.html
We have indeed specified both String.==
and identical
in the library documentation.
The behavior of identical
interacts with the language specification which specifies certain things must be identical. If we assume both specifications are correct, then that effectively specifies canonicalization rules (library documentation for identical
says it's true if it's the same object, or if it's a num
with the same representation, language spec says that certainidentical
invocations must be true).
It's still better to specify canonicalization explicitly instead of implicitly. (There have been a lot of different ways to read that part of the spec over time).
So do we wish to proceed and change the specification to make
identical()
implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?@munificent, @lrhn, @leafpetersen, WDYT?
Neither? Why is this a priority right now?
It's definitely not a priority. I think it probably came up because we just migrated the language/identity tests which have some tests around strings and identical()
that are more precise than the language requires and that I think fail on some platforms.
Feb 2024: Note that the original example prints 'true' in many (perhaps all?) configurations including dart
, dart compile js
, and dart compile exe
. This could imply that the discrepancy is disappearing over time, and more constant expressions are actually evaluated using constant evaluation:
void main() {
var s1 = "ab";
var s2 = "a" "b"; // or `"a" + "b"`.
print(identical(s1, s2)); // 'true'.
}
That used to work, but the VM still does not canonicalize var s3 = "a" + "b";
, where it does canonicalize const s3 = "a" + "b";
.
It's not that "a" + "b"
is not a constant expression, it's allowed in void foo([String s = "a" + "b"]){}
, it's just a constant expression which is not canonicalized. (Possibly because it's not recognized and evaluated as constant unless necessary.)
The degree to which canonicalization occurs in Dart, in particular for strings, has always been somewhat unclear.
We have explicit rules requiring that the following forms of constant expressions must be evaluated to the same (canonical) object when evaluating any of the occurrences of such an expression:
#foo
or#+
) are canonicalized.<constObjectExpression>
(such asconst C(2)
) is canonicalized.The definition of
identical()
ensures that instances ofbool
,int
, anddouble
are canonicalized, in the sense that they are considered identical when they represent the same value (and it is not observable whether this happens because the representation is a tagged bit string, or becauseidentical
treats distinct boxed representations in a special way).Strings have traditionally been treated differently:
The specification of
identical
actually requires thatidentical(c1, c2)
must evaluate to true whenwhich implies that the behavior of
dart
is wrong, anddart2js
is correct:"ab"
is a constant expression of typeString
,"a" "b"
is a constant expression of typeString
, and the two strings are equal according tooperator ==
. (We haven't specified the behavior ofoperator ==
on strings, but we hardly want to make those two strings unequal).I believe that all non-string objects are canonicalized or not in a way which is well-defined: It is specified for the basic forms mentioned above that every constant expression of these forms has a canonicalized value, and constant expressions do not allow for creating composite objects from existing objects, canonicalized or not, other than constant collection literals (e.g., we can't have
const C(x)
). In particular, we do not have to specify any additional rules saying that the value of a constant variable is canonicalized.It is unfortunate that different tools behave differently, but it is also likely to be a serious breaking change to change the behavior of any of these tools. So do we wish to proceed and change the specification to make
identical()
implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?@munificent, @lrhn, @leafpetersen, WDYT?