classes with structural identity

dart-lang / language

Design of the Dart language

Other

2.65k stars 202 forks source link

classes with structural identity #2246

Open mraleph opened 2 years ago

mraleph commented 2 years ago

I would like to suggest splitting changes to identity() rules from the records proposal into a separate proposal and allow defining classes with deep structural identity.

My proposal would be as follows: any class can be marked as having structural identity. For the purposes if this proposal we can think that this can be done by annotating a class with @pragma('value-type'), though a more user-friendly syntax can also be considered.

If identical(x, y) is called with x and y being instances of the same class C and this class uses structural identity then identical(x, y) is defined recursively as

identical(x, y) := identical(x.runtimeType, y.runtimeType) ∧ ∀ f ∈ fieldsOf(C) . identical(x.f, y.f);

Motivation

I would like to make this change to:

Resolve semantical discrepancies around handling of SIMD values by the Dart VM (see https://github.com/dart-lang/sdk/issues/43255).
Open new unboxing opportunities for the Dart VM, e.g. by allowing users to define their own value types which could be stored inline in the enclosing object.

Current state

Currently int and double values have no discernible identity in Dart (neither natively or when compiled to JS).

String has distinct and preserved identity in native Dart, but no discernible identity when compiled to JS.

Dart VM does not preserve identity of SIMD values (in violation of the spec), but dart2js does.

Implementation Concerns

There is a concern that introducing special treatment into identical would degrade its performance too much. Potential mitigation could be to restrict which classes can be marked as having structural identity, e.g. requiring such classes to extend Object and not have any superinterfaces would allow to optimise all identical(x,y) invocations where either x or y are known to have a non-Object, non-structural-identity static type.

Capitalising on unboxing opportunities might require some from of derived pointers, consider

@pragma('value-type')
class Rect {
  final double x,y,w,h;
}

void op(Rect r) {
}

class A {
  Rect r;
  void foo() {
    // If [r] is unboxed then naive implementation would require
    // copying [r] to the stack to pass it to [op]. A more sophisticated
    // implementation would be to pass a (derived) pointer to [this.r] 
    // instead - something that requires some level of sophistication in GC.
    // Possible approach to pass a pair `(this, &this.r)`.
    op(r);
  }
}

/cc @leafpetersen @lrhn @munificent

lrhn commented 2 years ago

The concept exists, and we keep coming up with types which deserve the treatment, so it's definitely worthy of consideration.

Having deep structural identity should preclude cycles, which likely means requiring the object to be immutable. Which is almost a given, since the mutability requires identity (otherwise, what are you mutating?)

Requiring no super-class or super-interface prevents an object with structural identity from flowing into a context where we need to check first before doing the identical check (other than Object and dynamic). I think that's too strict, and also not sufficient.

It's not sufficient because it still doesn't help with identical calls on Object or dynamic or any type variable with those as bounds. Might as well allow other types too. If it's necessary, it's necessary.

It's not realistic because we can't stop num and String from implementing Comparable, and I'd expect other such classes to implement precisely Comparable, but there will also be requests for JsonSerializable other utility interfaces. We can require no superclasses other than Object or other structural-identity classes.

I don't think we need to seal the type (no classes implementing it, not classes extending it, no classes mixing it in - which might be possible given the Object superclass, but unlikely since it also requires having no fields). We do need to require that any subclass/implementation itself has structural identity, if we want to use static types to know when to use structural identity. You can subclass, but not out of having structural identity. (Which means that as a marker, we could make all such classes extend a specific token class, Value, and require that all classes implementing Value must also extend Value and be immutable, and will then have structural identity. (Rather than inventing an entirely new kind of declaration, value Foo { ... }.

The classes do not need to have const constructors, since we don't care about compile-time canonicalization, but they obviously can since all fields are final.

The next request we'll get is to let those objects have automatic deep equality and hashCode as well. Then toJson. The moment we admit that we know the fields of a class, people will want to use that for other things too :wink:.

eernstg commented 2 years ago

@mraleph, I think we need a primitive notion of value classes: They would be special in that instances do not have identity (so a compiler can copy them freely, allocate them in registers, etc), and the standard notion of identical would be an implementation detail (if you get true then you know that you're looking at the same copy of a value, but if you get false then it may or may not be the same value).

So they would require immutability, and they'd need to have a known size given their type, so value classes can't have subtypes of any kind (other than Never). So if the static type of an expression is a value class, then its value is guaranteed to have that value type as its dynamic type as well, so we will know for sure that it can be, say, allocated in 3 registers, and that we can freely copy it.

We might give identical a different semantics for these objects (e.g., based on a structural traversal as you propose), but I tend to prefer that identical keeps its low-level semantics (where we actually get to know that two objects are the same object, physically).

We could then use this very low-level identical as the fast path in an implementation of operator ==. The remaining cases would fall back to a structural comparison as you describe it (deeply, based on getters that are associated with final instance variables whose type is a value type, which is known because there are no subtypes, and comparing references for other getters). I would expect this operator == to be implemented by a macro, because there will always be special cases where any particular policy isn't the desired one, but I think this "deeply for values" traversal would be a rather useful special case.

Would that address the concerns you had in mind?

mraleph commented 2 years ago

I think we can choose to package things like immutability and structural identity together, but strictly speaking it is not required.

Things like immutability, excluding cycles and not having subtypes are required for reliable unboxing, but not a not related to how we define identical operation. Unboxing can only ever be observed through identical and as such it can be relegated to implementation detail - that's why I am primarily concerned about identical and less so about additional restrictions which allow to guarantee unboxing.

I also want to highlight that naive unboxing of arbitrary large structural types is likely going to be detrimental to performance.

Requiring no super-class or super-interface prevents an object with structural identity from flowing into a context where we need to check first before doing the identical check (other than Object and dynamic). I think that's too strict, and also not sufficient.

It's not sufficient because it still doesn't help with identical calls on Object or dynamic or any type variable with those as bounds. Might as well allow other types too. If it's necessary, it's necessary.

My suggestion does not come out of nowhere - it is based on the experience with optimising identical for the VM, where we see that optimisation which simply checks if neither arguments can be a subtype of num is actually good enough to catch and optimise identical to pointer comparison on hot paths.

I tend to prefer that identical keeps its low-level semantics (where we actually get to know that two objects are the same object, physically).

That already does not work for double and int types. Similarly if we eventually want to use JS records/tuples to back Dart's records we will not be able to observe the identity.

eernstg commented 2 years ago

@mraleph wrote (about preserving the low-level semantics of identical):

That already does not work for double and int types.

I don't think this deviation from the "same object, physically" semantics would interfere with any of the applications that we've discussed here: int and double can continue to do exactly what they do today, and the structural comparison of value types would work just fine, using identical as a primitive (which will determine the "same object, physically" query for every instance of a value type).

int and double would presumably not be value types, because they already have their own very special take on instance creation and object identity, and we probably don't want to rock that boat.

The point is that a structural equality operation will embody a policy, not just a mechanism. That is, the implementation will reflect semantic decisions that aren't unquestionable. So it should not be language defined, it should be customizable. Hence the proposal to do it using macros to introduce an implementation of operator == that performs a suitable structural traversal.

We will need the low-level semantics ("same object, physically") in order to be able write an implementation of those structural equality operations with good performance.

Of course, we could just introduce yet another built-in function (reallyIdentical) providing the "same object, physically" semantics, but that's a slippery slope (when will we then need reallyReallyIdentical?).

Similarly if we eventually want to use JS records/tuples to back Dart's records we will not be able to observe the identity.

I don't see a problem in letting identical observe the physical identity of an object, including a tuple (by the way, tuples should probably be value types). As always, true means that it is really the same tuple, and false means that we don't know anything before we've done some more work.

It's actually the very same situation as that of a class that defines operator ==. The intention is that object equality should be determined by running a non-trivial algorithm; for any desired semantics that I can think of, identical would still serve as a fast path where the result is definitely true, but when identical returns false we need to do more work.

About the JS records/tuples: If we translate identical to === then we will get true in a number of cases where two tuples have different physical representations but === components. Given that value types in Dart would not have identity, this could be perfectly OK. It could be considered as a result of run-time canonicalization, and I don't see why we shouldn't allow run-time canonicalization for value typed objects, given that we will allow unrestricted copying.

lrhn commented 2 years ago

My suggestion does not come out of nowhere - it is based on the experience with optimising identical for the VM, where we see that optimisation which simply checks if neither arguments can be a subtype of num is actually good enough to catch and optimise identical to pointer comparison on hot paths.

But num has a super-interface other than Object (Comparable<num>), so the requirement to not have super-interfaces does not appear to be necessary? It means that you can't optimize identical on something only known to be Comparable as much, because you need to check if it's a num. If that's what you want to get away from having to do, it's probably not viable for num or String. We can make the restriction on other structural identity-types, but I fear it may be too restrictive because those classes may legitimately need/want to implement interfaces too.

You can optimize identical on something typed as num today because you know it doesn't have subclasses that might not have structural identity. That's why I think adding the restriction that subclasses also must have structural identity might be useful. Without that restriction, all you can do is de-optimize when something can potentially have structural identity, but not re-optimize and skip the check if it definitely has structural identity.

lrhn commented 2 years ago

As a side note, objects with structural identity should probably not work with Expando, WeakReference or Finalizer, which depend on identity and life-time of an instance.

If we do deep identity (an alternative is "no idenitity", always say false), the objects should work with Map.identity (because it's just written using identical, not anything smart). We also need identityHashCode to work structurally.

lrhn commented 2 years ago

I don't see introducing "value types" (whatever that covers) as a necessity to introduce structurally identical(/unboxable) types. We can do that without value types, and then let value types also have structural identity if we introduce them.

The proposal here says "deep structural identity". That's one choice which allows unboxing. Another option is "no identity", so identical(x, x) gives false on such a type.

That still prevents you from recognizing that an object changes memory location and representation over time. Deep identity does so by ignoring location and only looking at field value (recursive) identity. No identity does it by just ignoring everything but runtime type and saying "no".

Either can work with value types too. Value types would likely get structural equality, so they may not need identity. Both also work with records, but deep identity matches JavaScript and would allow web compilers to keep using === for identical, which doesn't matter because we should move to using Object.is anyway (which I just found out existed, and does exactly what identical would, and it's recursive on JS records too).

eernstg commented 2 years ago

The kind of type that I'm proposing is intended to allow copying to take place freely. That's the point, that's the primitive semantics that we can't express in the current language, so we need a new language feature in order to do it.

Here's the reason why that gives rise to "values":

Given that such objects have no conceptual identity (because a compiler/runtime is allowed to break the physical identity anytime it wants, so any assumption about a conceptual identity would be misguided), I consider shallow immutability to be a derived requirement. It makes no sense to allow such objects to have mutable state.

Deep immutability is not a requirement, there's no problem (conceptually or technically) in considering a reference to a (potentially) mutable object as a value (so the reference itself is a value, but the referenced object may have identity, i.e., it may be a non-value).

Deep immutability could certainly be useful in some cases (e.g., "passing" immutable object structures from one isolate to another by sharing a reference), but that's easy to express as an extra constraint that any given value type does or does not satisfy.

In any case, I'm proposing a low-level mechanism, because we need to express the basic primitives as simple and clean as possible, and then we can combine the low-level mechanisms to form a bunch of different policies.

Those policies would include a definition of operator ==, e.g., as a structural traversal (on deeply immutable object graphs, or stopping at non-value references).

The point I made was that we need primitives. If we change identical to be more and more high-level, then we won't be able to express those policies based on mechanisms, because identical is the low-level mechanism that we already have, and if we put a lot of policy into that then we can't change it.

lrhn commented 2 years ago

Shallow immutability (unmodifiability) is definitely a requirement for structural identity, and deep is not.

If we choose deep structural identity (stopping at references to objects without structural identity, and doing reference identity on those), then it will allow unboxing, reboxing and inlining of the structural-identity based types.

I'm not sure I see the leap to also enforcing structural ==. If such a type uses the default Object.operator==, it will get structural identity checks. If it chooses to override operator==, it can. Nothing prevents it from doing:

 bool operator ==(Object other) => other is MyOwnType && field1 == other.field1 && field2 == other.field2;
 int get hashCode => Object.hash(field1, field2);

Providing an automatic operator== which does deep structural equality (stopping at references to objects without structural equality and doing == on those), is not doing anything you can't write yourself by adding == on the classes. I don't see why we have to add it. We can choose to do so, because we think it's what users want, but we can say that about a lot of existing classes too.

That does not hold for records, because they are structural types, so you can't declare methods on them, and we'd give them deep structural equality for free.

The point I made was that we need primitives. If we change identical to be more and more high-level, then we won't be able to express those policies based on mechanisms, because identical is the low-level mechanism that we already have, and if we put a lot of policy into that then we can't change it.

I do not get the point. If we make identical more high-level, by encoding semantic distinctions separate from "reference identitiy" into it, we no longer has the ability to distinguish references to objects that it makes equal. That's the goal, to not be able to distinguish those references. I don't see how that is an argument for also making ==/hashCode higher-level, because you can still override those if you want to. You can't override identical.

(And again, the goal of allowing implicit unboxing can also be served by not having identity at all. It's a little more dangerous, because the language currently do not have an value where identical(x, x) is false. It does have objects that are canonicalized (which is what "structural identity" simulates, without actually requiring it to be implemented as canonicalization).

eernstg commented 2 years ago

Providing an automatic operator== which does deep structural equality ... is not doing anything you can't write yourself by adding == on the classes. I don't see why we have to add it.

Right, which is exactly the reason why I characterize that as 'policy' and recommend that we handle it using user-chosen code (possibly generated by a macro, but maintaining that other choices are still available).

That does not hold for records, because they are structural types, so you can't declare methods on them,

(I don't see the connection, the interface of a structural type can certainly include signatures of methods, and it can be enforced that every actual entity with that type has an implementation. I tend to think that the syntax of records makes it natural to say "they can't have methods", and this also helps simplifying the subtype graph which is already pretty large when we have both positional and named components, and depth subtyping.)

and we'd give them deep structural equality for free.

Perhaps we'd want the "semi-deep" equality here, too, stopping at references where the referenced object isn't statically known to be a value? It seems error-prone if a record is unequal to itself, just because some tiny part of the program state was mutated in the graph of objects reachable from that record.

int _i = 0;

class C {
  int get i => i++;
}

void main() {
  var r = (1, C());
  print(r == r); // 'false'.
}

That's the goal, to not be able to distinguish those references.

That could be your goal, but I'd like to support performant implementations of multiple policies based on simple, fast, low-level mechanisms. And I don't think delineation of objects graphs ("where to stop") is sufficiently simple to be a mechanism.

Here's another example of a situation where we'd presumably want a well-defined primitive, rather than a higher-level policy that we can't change:

We could obtain a reified representation of the run-time type of an object by a primitive, say, a static method on Type like Type.runtimeTypeOf(Object? o), and we would then have this primitive available for the implementation of various policies. For example, it could be used to implement an instance method like Object.runtimeType (if we want to have that at all, but right now it's heavily breaking to just remove it).

The problem is that if we don't provide access to the primitive then we may be unable to ensure that we get the right policy. E.g., some classes could override runtimeType in ways that may seem smart and useful to the authors, but it might also break some code that they did not take into account. We can clearly see this conflict in a different language construct: The type test expression e is T relies on a primitive mechanism to obtain the run-time type of the value of e, it does not call runtimeType. This illustrates that we don't really mean it when we say that runtimeType is a normal instance method and you can override it, we actually want to express the type test in terms of the primitive.

My take on primitives is that they are important, and they should be well-defined and available, even in the case where they are only used in one or two locations in the code. That's the reason why I recommend that we do not change identical to implement some more or less perfectly chosen high-level policy.

lrhn commented 2 years ago

(I don't see the connection, the interface of a structural type can certainly include signatures of methods, and it can be enforced that every actual entity with that type has an implementation.

When I say "structural type", I expect that two separate declarations with the same structure (whatever that means) defines the same type. Because of that, user-added members on structural type declarations do not make sense to me. Well, not unless they effectively work like extension methods, and apply to anything of that structural type, and that still has to be scoped somehow (just like extension members, which is why extension members can apply to structural types like function types).

My take on primitives is that they are important, and they should be well-defined and available, even in the case where they are only used in one or two locations in the code. That's the reason why I recommend that we do not change identical to implement some more or less perfectly chosen high-level policy.

My take is that identical is a primitive that distinguishes distinguishable values. Objects with identity are distinguishable from anything else. Values without identity, if the language contains them, are not distinguishable if they have the same contents. There is no "more primitive" identical operation that can distinguish them if the language states that those values do not have identity at all. Giving them an identity anyway, one which might change during normal data flow, would be an implementation detail leaking through and not an actual identity. It's something else than identity.

I'm fine with introducing values without identity into the language, ones where identical doesn't work like it does for objects with identity. I'm fine with either letting identical return false, or letting it be based on structural recursive identical checks. I would be fine with having both. Is that what you are asking for - separate primitive operations for different behaviors?

eernstg commented 2 years ago

When I say "structural type", I expect that two separate declarations with the same structure (whatever that means) defines the same type.

I think the terms aren't very well standardized, so we might be able to define 'structural type' to mean anything we want.

Parenthesis-begin.

As an aside, my definition of a structural type was based on having a structural (rather than nominal) subtype relation. This would make Point3d a subtype of Point2d below, just because the interfaces have the right superset-and-override relationship:

class Point2d {
  final int x, y;
  Point2d(this.x, this.y);
}

class Point3d {
  final int x, y, z;
  Point3d(this.x, this.y, this.z);
}

Nothing prevents the addition of methods to those classes, and we can still check whether their interfaces satisfy the superset-and-override relationship (just like the tests that we would perform if Point3d had specified that it implements Point2d).

Function types are structural according to this perspective because they have a subtype relationship which is computed based on their parts (return type, parameter types, names of named parameters), not based on an explicitly declared subtype relationship. The fact that function types use a syntax where there is no space for method declarations is just a matter of syntax, we could allow for methods on those as well.

Two structural types are then the same type if and only if they are mutual subtypes.

(Note that this implies that the implementation isn't part of the type, because nothing requires those mutual subtypes to have the same implementation of any given member. We could allow nominal types to declare that they are equal to each other, too, and get the same kind of implementation abstraction, but we usually get that by using a subtype of some "public" interface.)

Parenthesis-end.

Values without identity, if the language contains them, are not distinguishable if they have the same contents.

That's a beautiful and principled rule. However, I think it's naive to assume that properties that potentially involve the entire set of objects reachable from a given object are going to be unquestionably defined in one particular way: That kind of feature is a policy, not a mechanism. So we should not bake it into the language.

As an example, if we have two values v1 and v2 and the same path from both of them yield a non-value o1 respectively o2, should we then define v1 and v2 to be "the same" if identical(o1, o2), or should we use o1 == o2 (plus, of course, a similar test for all other reachable objects, for some definition of 'reachable')? Surely we'd want the former in some situations and the latter in other situations.

So we need to be able to implement "sameness" properties in user-written code. This can only be done with good performance if we have access to the primitive which allows for a fast path based on being the same representation in memory, and identical does exactly that (except for numbers, but they have their own semantics and that's fine).

[physical identity] would be an implementation detail leaking through and not an actual identity

There is nothing new in having multiple distinct objects representing the same conceptual entity. For instance, a list containing a sequence of objects o1.. ok may very well be considered to be conceptually the same thing as a different list containing the same objects in the same order (especially if they are immutable). But those lists are going to be considered distinct according to identical, also if we introduce values as discussed here, because those lists won't become values. Similarly, any class that has a non-trivial operator == would define "sameness" in a way where identical is potentially a leak in this sense.

So the fact that values would leak exactly the same kind of information (that is: these two values have distinct representations in memory) is not new, and won't go away.

So we may as well allow for user-specified "sameness" with good performance based on identical, and maintain that identical is a low-level primitive, and it is a mistake to interpret it as a conceptual feature.

rakudrama commented 2 years ago

As a side note, objects with structural identity should probably not work with Expando, WeakReference or Finalizer, which depend on identity and life-time of an instance.

Can we make the type system prevent these uses? That would be nicer than a runtime check, or worse, undefined behaviour.

If we do deep identity (an alternative is "no idenitity", always say false), the objects should work with Map.identity (because it's just written using identical, not anything smart). We also need identityHashCode to work structurally.

'not anything smart' does not apply to to the web. Currently, we go to great lengths to avoid any kind of hash code on Strings because it has to be recomputed every time - there is no reasonable way to cache it. Similarly, identityHashCode relies on an expando, so we have to do a map lookup in order to do a map lookup!

One silver lining is that now we no longer support IE11, we can lean on ES6 Maps, which pretty much do exactly what we want for Map.identity. On our backlog I worry that making Map.identity work with structural identity will close the door on using ES6 Maps in a direct and efficient manner.

rakudrama commented 2 years ago

Similarly if we eventually want to use JS records/tuples to back Dart's records we will not be able to observe the identity.

We can't rely on a proposal, even when accepted, for several years. We only recently removed support for IE11. A major customer requires compatibility with Safari 12.

We would need a compelling intermediate strategy.

eernstg commented 2 years ago

@rakudrama responded to @lrhn:

As a side note, objects with structural identity should probably not work with Expando, WeakReference or Finalizer, which depend on identity and life-time of an instance.

Can we make the type system prevent these uses?

Surely only in the case where the code that creates the connection is typed sufficiently tightly, so we can't rely on that.

If an Expando sets an association based on one representation of a value object, and an attempt is later made to retrieve the stored value using a different representation, then we might not find anything (or we might find something else). So that's clearly a problem.

But isn't this essentially the same problem as using an Expando with any regular class where there is a definition of operator == and hashCode (which would presumably imply that two objects should be considered to be conceptually the same object based on ==)?

import 'dart:math';

// This class can't be a value class, because it caches the result of a computation.
class IntBox {
  final int x;

  IntBox(this.x);

  late double squareRoot = sqrt(x); // Assume that `sqrt` is really expensive, so we cache it.

  @override
  operator ==(other) => other is IntBox ? x == other.x : false;

  @override
  get hashCode => x.hashCode;
}

void main() {
  var map = Map<IntBox, int>.identity();
  var b1 = IntBox(0), b2 = IntBox(0);
  map[b1] = 1;
  map[b2] = 2;
  // ...
}

Except for the fact that this class uses a cache to avoid repeating an expensive computation, this class is a perfect candidate to be a value class. So we probably want to allow it to behave very much like a value class, and in particular we should be able to create a new object with the same state as an existing one, and they should be treated as "the same thing" even though they are two distinct physical representations. For such objects, it's a bug to use the physical identity, they should be tested for "sameness" using ==.

So we have the same kind of problem with value objects and with these objects where "sameness" has been specified using == and hashCode.

It is worse in one way, though: With the regular class the developer could "simply" canonicalize each conceptual object manually, such that == will coincide with the physical identity. With value objects this gets harder because the compiler may create new copies (and there could be several heap allocated ones) during normal data flow like passing an actual argument to a function.

In both cases the cure is to stop using that kind of object with an Expando, so we should view this as an existing problem that gets a new facet, not a new problem.

An identity map would give rise to very similar considerations, and so does Finalizer.

WeakReference might well be a benign case: Even in the case where there is a weak reference to a heap allocated representation of a value object, it would just work like a reference to any other heap allocated object, and the heap allocated representation could be garbage collected as usual, resetting the weak references, if no other reference exists.