[vm] Records of primitives requires boxing

ds84182 commented 1 year ago

When passing a record of primitives between functions, the values are boxed. Since these are value types without identity, they could exist as an unboxed record shape until they escape via covariance, similar to how int and double primitives work today (int -> num or double -> num require boxing).

Records also lose static type information, resulting in a runtime call to _Double operator+ instead of unboxing.

navyakarna commented 1 year ago

can i work on this issue ?

mraleph commented 1 year ago

/cc @alexmarkov

alexmarkov commented 1 year ago

There are multiple optimizations missing here:

Unboxing record instances when passing them as parameters (remove record instance allocation and pass record fields as separate arguments).
Unboxing fields inside record instances (remove int/double boxing for record fields when they are passed or returned separately).
Global type inference should infer actual types through record instances and fields.
Local type inference should keep static types of parameters as actual type (_Record) could be less informative. That should allow the compiler to replace call to _Double.+.

Currently AOT compiler can unbox record instances (of 2 fields) in return values, as returning multiple values would be probably the most common use of records. At this point we are not sure how records would be used in practice and what other scenarios would be performance critical. Do you have a real use case where performance of passing record instances as arguments would be critical?

(1) and (2) is not that hard to implement. However, we should try to find a motivating example / real use case before adding these optimizations, as they would increase complexity of the implementation.

(3) may severely affect AOT compilation time, so I'd like to avoid adding it unless absolutely necessary.

(4) looks like a bug and I'd like to fix it.

ds84182 commented 1 year ago

My primary use-case for records is to support vector value types via records + inline classes. It's much nicer to deal with compared to vector_math since the underlying storage is immutable and const.

inline class Vec4 {
  final (double, double, double, double) xyzw;
  const Vec4.raw(this.xyzw);
  const Vec4(double x, double y, double z, double w) : super((x, y, z, w));

  double get x => xyzw.$1;
  double get y => xyzw.$2;
  double get z => xyzw.$3;
  double get w => xyzw.$4;

  Vec4 operator+(Vec4 rhs) {
    var (lx, ly, lz, lw) = xyzw;
    var (rx, ry, rz, rw) = rhs.xyzw;
    return Vec4(lx + rx, ly + ry, lz + rz, lw + rw);
  }

  // ...
}

Another example is 128-bit ints via inline class:

inline class Int128 {
  final (int, int) _lohi;

  Int128.from(int value) : _lohi = (value, 0);

  // ... implementations here
}

I'd still expect something as large as a 4x4 matrix to be heap allocated though. But these small records are the perfect candidate for unboxing.

ds84182 commented 1 year ago

There's some weird interactions with inlining.

These two programs are not equivalent:

class Box2<T> { final T x, y; const Box2(this.x, this.y); }

void main(List<String> args) {
    final record = Box2(1.0, 1.0);
    inline(record);
}

@pragma('vm:prefer-inline')
void inline(Box2<double> xy) {
    var Box2(:x, :y) = xy;
    print(x + y);
}

*** BEGIN CFG
After AllocateRegisters
==== file://<source>_::_main (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v26 <- UnboxedConstant(#2.0 double) T{_Double}
}
  2: B1[function entry]:2
ParallelMove xmm0 <- C {
      v2 <- Parameter(0) T{List<String>}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     MoveArgument(v26, SP+0)
  8:     StaticCall:48( print<0> v26)
  9:     ParallelMove rax <- C
 10:     Return:16(v0)
*** END CFG

void main(List<String> args) {
    final record = (1.0, 1.0);
    inline(record);
}

@pragma('vm:prefer-inline')
void inline((double, double) xy) {
    var (x, y) = xy;
    print(x + y);
}

*** BEGIN CFG
After AllocateRegisters
==== file://<source>_::_main (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v3 <- Constant(#1.0) T{_Double}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{List<String>}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     MoveArgument(v3, SP+1)
  8:     MoveArgument(v3, SP+0)
 10:     v14 <- StaticCall:42( +<0> v3, v3, using unchecked entrypoint, recognized_kind = Double_add, result_type = T{_Double}) T{_Double}
 12:     v16 <- Unbox(v14) T{_Double}
 14:     MoveArgument(v16, SP+0)
 16:     StaticCall:44( print<0> v16)
 17:     ParallelMove rax <- C
 18:     Return:14(v0)
*** END CFG

Without inlining, it unboxes x and y. But with inlining + load-store forwarding, it loses type information. Guessing it trusts the static type of the parameter when not inlined, but when inlined it only has AllocateSmallRecord to go off of?

alexmarkov commented 1 year ago

Here is what was done so far:

Global type inference is improved to infer record shape (110b0f0eba0a483245898830a1603b8ae7907e08) and the inferred record shape is now used in the unboxing (295e2109756b474088c579af24c8609646c99302). This is a necessary groundwork for unboxing record parameters, but they are not unboxed yet.
Local type inference is improved to better propagate static types (8c7d47cd19ec52da56ce9b1d92bc6995c63313b7). This allows compiler to recognize int/double operations performed on record fields. The first example now uses BinaryDoubleOp instead of a call to Double.+.
Argument and parameter types are combined during inlining (5117c457700651a7566e0d6d935b1af2d226e02c and 9c3e294dd28fcf62b4779a65a6efa28914a9e4fe which further improves handling of record types). These changes fix the problem during inlining which was reported in the previous comment https://github.com/dart-lang/sdk/issues/51637#issuecomment-1484032933.

ds84182 commented 1 year ago

Thanks! I accidentally stumbled upon another place this comes up, looks like type propagation does not work for records that are allocated then immediately destructured: Godbolt

This causes the compiler to switch to doing table calls for Uint8List.[] which increases code size by a lot. Maybe an earlier CSE pass could help here, maybe directly after inlining?

dart-lang / sdk

[vm] Records of primitives requires boxing #51637