dart-lang / language

Design of the Dart language
Other
2.66k stars 205 forks source link

How should identical() behave for constant records? #1289

Closed munificent closed 3 years ago

munificent commented 3 years ago

The proposal currently says:

Two constant records with the same shape and identical corresponding pairs of fields are identical. This is the usual rule that constants are canonicalized.

But also:

It is possible to create a single record, have two different references to it flow through the program and then have identical() on them return false because the compiler happened to optimize away one or the other's representation such that they are no longer references to the same object in memory.

These are in conflict. We don't track "constness" as references to a constant flow through a program, so as soon as you assign a constant record to a variable, return it from a function, etc. there's no way to tell how it should behave with regards to identical().

If we want to be maximally compiler-friendly, we should probably say that identical() does not guarantee that it returns true for two constant records with the same shape and identical fields. In other words, no canonicalization for const records. But that may be too big of a departure from how const works elsewhere.

Are there other options?

lrhn commented 3 years ago

I've been having the same thoughts. Unless we can promise which record value propagations preserve identity, we can't reasonably say anything useful. Passing records as arguments the identity function is itself a potential copy, so even if you could somehow know that a and b refer to the same record, a call like identical(a, b) is not guaranteed to preserve that.

We could say that any assignment of a record between static types which are not a record types, will preserve identity (boxed records have identity), and any assignment of a constant record to a non-record type gives identical (boxed) records each time. Assigning or casting such a value to a record type (unboxing) would not be guaranteed to preserve identity.

We can then also specialize identical(v1, v2) where either operand is statically typed as a record type, and simply say false (unboxed records have no identity).

If we don't promise anything for type variables (they are neither considered record types nor non-record types, just "potential record types"), we are still open to specializing generic functions and classes to record types if we want to.

It does mean that something like foo(Object o){ if (o is (int, int)) { ... identical(o, o) ...; }} could not just unbox o internally at the promotion and forget the original identity. It would need to keep the o boxed object around to retain its identity. (Unless we specify that promotion can change identity, but I think that's too weird).

That approach is actually just being explicit about when we do boxing and unboxing. It does restricts compilers from doing extra unboxing, or at least throwing away the identity, although a compiler can obviously unbox extra if it can see that identity cannot matter. (I doubt that's non-trivial to detect, though).

lrhn commented 3 years ago

Adding an extra "ID number" can definitely work, but it also causes an extra overhead of 50% for two-tuples. if you want to pass a tuple in registers, you need one extra register.

I think the reason I hadn't even considered it was that my goal with tuples was a zero-overhead unboxed tuple implementation. If we don't care about overhead, we can just box every tuple. That would add one extra word per tuple object, plus one per reference to it (which is what any object, and boxed tuples, would cost anyway). I still think the zero-overhead is a good goal to have. :smile:

mraleph commented 3 years ago

What about defining identity recursively based on contents? e.g. identical((a_i), (b_i)) = forall i . identical(a_i, b_i). This would be similar to how numbers work (their identity is the same as their bitwise equality).

I understand that it might not be fast, but it is more predictable than defining identical() to return what amounts to a result of a coin flip.

Note that due to how numbers identity is defined identical(a, b) is only a simple pointer comparison if static type of a or b is not a supertype of num. So I think extending identical to also have special handling of tuples is not going to make anything else worse.

lrhn commented 3 years ago

Requiring recursive identity checks and preserving identity means that any Object o1 = ..., o2 = ...; print(identical(o1, o2)); has to start out checking whether both objects are records.

That might not be a big problem since we already need to check if they're doubles. I guess we could start out checking whether both references point to the same object. If so, say treu. If not, then check whether both objects have the same type (map pointer), and if not, we can immediately say false. Then we check whether the map pointer is the double type or a record type, and if so bail out to slower code which does recursive checks on the elements (bit-eq for doubles, recursive identical on each entry for boxed records).

I'm a little worried about how this will affect web compilation. In JS compiled code, we simply use === for identical checks. (Which is wrong since -0.0 === 0.0 and NaN !== NaN, the latter has caused bugs already).

If we do identical(objectRef, (1, 2)) we can special-case that to avoid boxing the record, and just checking the object reference's map pointer against the record type (_,_).

(We probably have to special runtimeType to construct the type when asked since we don't record the type of the entries anywhere, only the structure).

eernstg commented 3 years ago

It would make sense to promise very little about identical with records. I think we should optimize records for allowing inline representations, and we should allow the execution to box and unbox wherever that is required or seems optimal.

We would then consistently recommend that equality among records is tested using operator ==, and give developers the warning that identical will not work: There could always be a false negative. There are no false positives, so identical can be used in the implementation of operator == of records (and by anyone in situations where the false negatives are taken into account).

With that, we could allow implementations to use whatever level of canonicalization they prefer.

Also, a compiler may generate specialized code for e1 == e2 in the case where both operands have a concrete record type, which means that it may in some cases compile down to a couple of register comparisons rather than a full fledged method invocation (because record types have no non-bottom subtypes).

munificent commented 3 years ago

Sorry for the naivete, but the obvious idea of adding a hidden integer "id" field to every record - won't it suffice?

That would work, yes, but it means the overhead of another field that needs to be stored and copied. It's a good idea and I wouldn't rule it out.

munificent commented 3 years ago

What about defining identity recursively based on contents? e.g. identical((a_i), (b_i)) = forall i . identical(a_i, b_i). This would be similar to how numbers work (their identity is the same as their bitwise equality).

Records aren't deeply immutable, so you could have cycles:

var list = [];
var tuple = (1, list);
list.add(tuple);
print(identical(tuple, tuple)); // ?

So not only would identical() have to recurse, it would also need cycle detection. At that point, the speed and complexity are such that it violates what I consider to be the core contract of identical(): it's fast.

I'd rather a fast identical() with false negatives than one whose performance I as a user can't easily predict.

I understand that it might not be fast, but it is more predictable than defining identical() to return what amounts to a result of a coin flip.

It's not quite that bad. It's basically "if I happen to be able to tell you that these are identical quickly, I will. If I can't, you have to check equality yourself." I think that's a reasonable contract. It might make sense to think of identical() more as returning one of two values: "yes, equal" and "don't know". It happens to encode the latter result as false.

mraleph commented 3 years ago

So not only would identical() have to recurse, it would also need cycle detection.

Note that I have only redefined identical for records. All other types keep their identical() semantics. This means cycles are irrelevant as they can only occur though non-Record objects and non-Record objects have non-recursive reference based identity (or non-recursive value based identity for numbers) which in both cases terminates recursion.

print(identical(tuple, tuple)); // => true
munificent commented 3 years ago

Ah, right, thanks for helping me understand. :)

I'm still not sold on structural identical() for records, but it's not as bad as I feared.

eernstg commented 3 years ago

identical is biased: Whenever it returns true, we can conclude equality (sane equality, ignoring operator ==(other) => false;) and it's assumed to be a fast operation. But when identical returns false we don't know much.

For instance, if we tear off an instance method, it is guaranteed to be == to another torn-off instance method with the same receiver and method name, but nothing is guaranteed about identical; two evaluations of the same function literal are not identical; etc.

So the simple rule would be "identical is a fast operation whose positives are true (identical(e1, e2) => e1 == e2), but negatives may always be false". If a type C has the contract that it uses identity for equality then we can trust the negatives, too, but this won't apply to very general types like Object.

This makes me doubt the value of making identical structural (hence slow) on records. If identical on records is strictly based on object identity then it can be used according to the above rule, and record comparison should otherwise use ==.

lrhn commented 3 years ago

Structural identity makes sense to me. We define equality structurally because tuples have no meaning other than containing its values in a particular order. The equality we define is logical equality (meaning the same thing, as defined by the == operator), where as identical is mathematical equality (being the same thing).

So, equivalent reasoning can be used to say that mathematical equality of a product is structurally defined as mathematical equality of the projections (because that's what mathematics would normally say). In the semantic model, if (1, 2, "a") does not have a separate identity, then there is only one semantic value for that product, and hence it's identical to ... itself.

We are the one deciding what the underlying semantic model is, and we can say that tuples can have identities at times, and they're just not preserved by data flow the way object identities normally are. We will have to specify it eventually, so we should be careful that whatever behavior we want is one that we can specify without constraining implementations too much.

Structural identity is one end of the spectrum. It completely specifies behavior, which means that implementations need to implement that behavior. It's simple and consistent, but may introduce run-time overhead.

Completely unspecified identity is the other end of the spectrum. Then identical(a, b) may answer false at any time if either of a or b is a tuple, but may answer true when both are tuples containing identical values. (We need identityHashCode to agree, on both sides, so identical should be symmetric). We run into problems with constants here, because we actually need to know when two tuples are "the same value" for default values. So, constant tuples might still need to have structural identity (mathematical "same value" rules).

In the middle we have something like "boxed tuples preserve identity, constant tuple expressions always box to the same value", where we try to delineate some cases that users can depend on, and leave the rest unspecified. It risks being arbitrary and fragile around the edges, because users might not completely understand where the borders of predictable behavior are.

I like structural identity. It's mathematically sound. It's understandable by users because it has no sharp edges. It's not particularly efficient if you have large nested tuples, but two out of three ain't bad. It might risk making all identical calls slower because it has to check whether the operands are tuple types (at least if the operands are of type Object or above), but it already has to do that for doubles and integers anyway, so we might be able to do just one "has special identical" check on the fast path.

eernstg commented 3 years ago

We need to converge on a view of identical.

I think we've treated identical as a way to determine that two expressions denote the same representation at run time. This amounts to pointer equality on two references to the heap, and it amounts to NaN values having the same underlying bit patterns, which is clearly a low-level representation oriented result, as opposed to a conceptual (say, mathematical) result. Similarly, we haven't tried to let identical(f1, f2) return true in the cases where f1 and f2 are torn-off extension methods, no matter how similar they are (except for the case where it is the same representation).

So identical will generally deliver true positives (NaN being the weird exception, as always) as a fast way to determine equality. A negative result can only be trusted if we can verify (statically or dynamically) that it is correct to consider "not same representation" as a guarantee for "not equal". This is not true for any class that defines operator == to use anything other than identity, but it may be safe for many static types in practice.

Structural equality, as proposed, is clearly the useful notion of equality for representations that model values, in particular tuples.

But extending this to let identical use structural comparison of tuple types seems inconsistent to me. We cannot view identical as a conceptual operation, it is a low-level shortcut that must be used with caution.

mraleph commented 3 years ago

One more point to make here: upcoming JavaScript records are going to have structural equality (e.g. == would look at contents rather than identity so you would not be able to discern their identity): https://github.com/tc39/proposal-record-tuple#overview

lrhn commented 3 years ago

I don't believe that identical is necessarily just a low-level operation. That's a choice. We could have always compared reference bit patterns if we wanted to. Then integers and doubles might not be identical, even if they have the same value. We chose not to do that because it makes identical more useful if it isn't unpredictable on numbers.

One thing we require for a identical is that it never answers true in for values which can be distinguished in any way. If two values are identical, there must be no expression where inserting one of the values yields a different result than using the other value. That technically means that we can always answer false, but then it would be useless.

Any mutable object necessarily has identity, otherwise "mutation" is meaningless. We want to detect that identity, so that we can predict when mutating one object reference will also mutate the other.

That leaves immutable objects where we have options.

For integers and doubles, we want to allow transparent boxing and unboxing, so we made the identical of those map to the representation of the value, not the reference. The reason we chose the bit-representation was the rule above: We can distinguish bits of integers trivially, and of doubles by writing the value into a Float64List and reading the underlying buffer. There are other approaches possible, for example the WebCore JS engine never exposes the bits of a NaN. There is only one NaN value in their JS Number implementation (the non-negative zero-payload QNaN), and reading a NaN with payload from a Float64Array and writing it back will canonicalize. We chose to expose the bits of NaNs and preserve them. That's a choice (and nobody would notice if we changed it).

We also don't allow Expandos on ints or doubles. That's not (just) because we fake identity, but because we can always create a new instance with the same identity, so we would never be able to GC an entry on an integer. We should probably also not allow expandos on tuples, no matter what we do about identity. It's safer that way.

We canonicalize constants. They are class instances, and we otherwise say that "creating a new instance" means creating something fresh with a unique identity distinguishable by identical. (That's why our int and double documentation never says "new instance'.) Values of constant expressions are deeply immutable, and we can detect them at compile-time, and we chose to make them identical. We then have to preserve that identity. (The VM even re-canonicalizes when you send a compile-time constant created value through a SendPort to an Isolate.spawn created isolate, or to the same isolate, because people were expecting it to work). That was a choice, but one we're stuck with now. People rely on canonicalization of constants. We then use that to our advantage when we say that instance methods must have "the same" default value in subclasses, meaning identical constant values. Otherwise we'd have to define "the same".

So, tuples.

They're immutable, so we are not bound to matching "sharing of mutatation". They're very likely going to be used as default values of tuple-typed parameters, so we need them to be able to be constant expressions. We then need a notion of "the same" value.

We're also going to support boxing and unboxing, and passing tuples around typed as Object (and passing identical around as bool Function(Object?, Object?)), so when we end up checking whether two tuple values are identical, we need to be able to do it at run-time with no hint that a tuple-comparison is going to happen.

We've listed the option already: Never identical, sometimes identical (predictably or unpredictably), structural identity. The question, as for int and double is which one is more useful?

I think structural identity has an upper hand in predictability and usability. It is more costly, though. We'd have to recognize boxed tuples, check they have the same structure, then recurse on each element. We already recognize numbers and do a specific comparison on their values (which happen to not recursively contain other values). It means that a new identical object can always be created from the parts, and we should not allow Expandos on tuples.

Always saying no is also possible. Again we have to recognize tuples, but then the answer is quick. It's predictable, but hardly usable. It doesn't answer when two tuple-typed default values are "the same value".

Allowing tuples to change identity at any time is unpredictable and unusable. Might as well just say false.

Allowing tuples to preserve identity while stored as non-tuple typed (when boxed), and letting constant tuple expressions always box to the same value, is somewhat predictable and somewhat usable. I fear the limits might have some sharp edges. For example, we'd likely preserve identity for tuples passed through generic types initially, and then type-specialization as an optimization might break code depending on that unspecified behavior.

So, long-term predictability and usefulness makes me actually prefer structural identity.

(When I use "mathematical equality" I really mean semantically indistinguishable. That is, the (semi-mathematical) semantic domains we use to describe the program semantics are incapable of distinguishing the equal entities. If that is the case, then the objects are obviously semantically identical. We can still claim that identical(o, o) is false, just as double.nan != double.nan. With structural tuple identity, our semantic model doesn't need to model boxing and unboxing, it can just consider a tuple value to be the product of the values and a representation of the structure).

munificent commented 3 years ago

I've updated the proposal based on our design discussion this morning. The current plan is to take @mraleph's suggested approach and define identical structurally. Thanks for the excellent suggestion!

Before we commit to it fully, I'll spend some time talking to the implementation teams to make sure we understand and are OK with the performance.