force inline small double/int methods, especially if parameters require boxing

mraleph commented 4 years ago

When looking at native code produced for scaleRadii from Flutter dart:ui, I see us emitting code like this:

      v3 <- Constant(#1.0) T{_Double}
...
      v420 <- Constant(#0) [0, 0] T{_Smi}
...
66: B155[target]:116
 68:     PushArgument(v3)
 70:     PushArgument(v420)
 72:     v400 <- StaticCall:126( ==<0> v3, v420, recognized_kind = Double_equal, result_type = T{bool}) T{bool}
 73:     ParallelMove r0 <- r0
 74:     Branch if StrictCompare:130(===, v400, v38) goto (156, 158)

and

78:     v603 <- Box(v220) T{_Double}
 80:     PushArgument(v603)
 82:     v404 <- StaticCall:134( get:isNegative<0> v603, recognized_kind = Double_getIsNegative, result_type = T{bool}) T{bool}
 83:     ParallelMove r0 <- r0
 84:     Branch if StrictCompare:138(===, v404, v38) goto (161, 159)

This generates more code then we would get by ensuring that methods isNegative and == are inlined.

We should investigate why and how often it happens and ensure that they are inlined.

/cc @mkustermann @alexmarkov @sjindel-google

mraleph commented 4 years ago

@sstrickl I think this might be a good place to start your inlining investigation project.

sstrickl commented 4 years ago

So looking at the example methods you mentioned, they're both intrinsics, and _Double.== is marked as never inline. Currently, our inliner bails out on intrinsics unless it's marked as always inline, and even if I added @pragma('vm:prefer-inline') to Double.getNegative, there's a InlineBailout in NativeCall in kernel_to_il.cc which will keep it from inlining (and the same would hold if I changed never to prefer in _Double.==).

So I'm investigating the whys of all of these decisions now, to decide whether there's anything that can be done to loosen them in a way that might allow us to inline these and similarly gatekept functions. I'd also appreciate any historical knowledge you and other long-time VM developers have about these decisions. (/cc @mkustermann )

mraleph commented 4 years ago

Yeah, I think this is another example of poor architecture in our compilation pipeline:

We have intrinsified functions which are never inlined by the compiler, even though if we were to inline them we would get a smaller code. We often handle this by providing hand written IL graphs in either call specializer or inliner or graph builder (sometimes we have duplicated code due to that).

I think it would be good to have some uniformity here - have a single piece of code that defines IL graph, which can be used either to produce normal function body (e.g. to use as intrinsic) and can be inlined / used by the call specialiser.

mraleph commented 2 years ago

We should definitely review possible approaches here. Currently we are unable to unbox method receivers which is causing us some code quality issues on methods of double and int. When I look at the graph of BoxConstraints.enforce for example I see the following:

Graph of package:flutter/src/rendering/box.dart_BoxConstraints_enforce

``` After AllocateRegisters ==== package:flutter/src/rendering/box.dart_BoxConstraints_enforce (RegularFunction) 0: B0[graph]:0 { } 2: B1[function entry]:2 { v2 <- Parameter(0) T{BoxConstraints} v3 <- Parameter(1) T{BoxConstraints} } 4: CheckStackOverflow:8(stack=0, loop=0) 5: ParallelMove r0 <- S+3 6: v6 <- LoadField(v2 . minWidth {final}) T{_Double} 7: ParallelMove r1 <- S+2 8: v8 <- LoadField(v3 . minWidth {final}) T{_Double} 10: v10 <- LoadField(v3 . maxWidth {final}) T{_Double} 12: v58 <- Box(v6) T{_Double} 14: v60 <- Box(v8) T{_Double} 15: ParallelMove S-2 <- r3 16: v62 <- Box(v10) T{_Double} 17: ParallelMove S-1 <- r4 18: PushArgument(v58) 20: PushArgument(v60) 22: PushArgument(v62) 24: v12 <- StaticCall:18( clamp<0> v58, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double} 25: ParallelMove r1 <- r0, r0 <- S+3 26: ParallelMove S-3 <- r1 26: v14 <- LoadField(v2 . maxWidth {final}) T{_Double} 28: v66 <- Box(v14) T{_Double} 30: PushArgument(v66) 32: PushArgument(v60) 34: PushArgument(v62) 36: v20 <- StaticCall:26( clamp<0> v66, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double} 37: ParallelMove r1 <- r0, r0 <- S+3 38: ParallelMove S-4 <- r1 38: v22 <- LoadField(v2 . minHeight {final}) T{_Double} 39: ParallelMove r2 <- S+2 40: v24 <- LoadField(v3 . minHeight {final}) T{_Double} 42: v26 <- LoadField(v3 . maxHeight {final}) T{_Double} 44: v74 <- Box(v22) T{_Double} 46: v76 <- Box(v24) T{_Double} 47: ParallelMove S-2 <- r3 48: v78 <- Box(v26) T{_Double} 49: ParallelMove S-1 <- r4 50: PushArgument(v74) 52: PushArgument(v76) 54: PushArgument(v78) 56: v28 <- StaticCall:34( clamp<0> v74, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double} 57: ParallelMove r1 <- r0, r0 <- S+3 58: ParallelMove S-5 <- r1 58: v30 <- LoadField(v2 . maxHeight {final}) T{_Double} 60: v82 <- Box(v30) T{_Double} 62: PushArgument(v82) 64: PushArgument(v76) 66: PushArgument(v78) 68: v36 <- StaticCall:42( clamp<0> v82, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double} 69: ParallelMove r1 <- r0, r0 <- S-3 70: ParallelMove S-1 <- r1 70: v64 <- Unbox(v12) T{_Double} 71: ParallelMove DS-6 <- v0 72: v4 <- AllocateObject:10(cls=BoxConstraints, ) T{BoxConstraints} 73: ParallelMove r0 <- r0, v0 <- DS-6 74: StoreInstanceField(v4 . minWidth = v64, NoStoreBarrier) 75: ParallelMove r1 <- S-4 76: v72 <- Unbox(v20) T{_Double} 78: StoreInstanceField(v4 . maxWidth = v72, NoStoreBarrier) 79: ParallelMove r1 <- S-5 80: v80 <- Unbox(v28) T{_Double} 82: StoreInstanceField(v4 . minHeight = v80, NoStoreBarrier) 83: ParallelMove r1 <- S-1 84: v88 <- Unbox(v36) T{_Double} 86: StoreInstanceField(v4 . maxHeight = v88, NoStoreBarrier) 87: ParallelMove r0 <- r0 88: Return:46(v4) *** END CFG ```

Note that all BoxConstraints fields are unboxed which causes them to be reboxed just to be passed down to statically invoked double.clamp. This is a lot of overhead (the implementation of double.clamp itself is potentially slow but that's a separate issue: https://github.com/dart-lang/sdk/issues/46879).

I have written a microbenchmark which tries to establish the cost, and the boxing is probably contributing 100 ps per invocation (on a slower Android ARM32 device):

I/flutter (18964): double.clamp: 318.80 ps
I/flutter (18964): naive clamp: 11.55 ps
I/flutter (18964): naive clamp (arg check): 20.36 ps
I/flutter (18964): naive clamp (boxed): 98.43 ps
I/flutter (18964): naive clamp (simd): 3.32 ps
I/flutter (18964): naive clamp (simd array): 6.13 ps

On a ARM64 device (faster, but downscaled to 1132800):

I/flutter (12151): double.clamp: 197.60 ps
I/flutter (12151): naive clamp: 7.38 ps
I/flutter (12151): naive clamp (arg check): 13.85 ps
I/flutter (12151): naive clamp (boxed): 94.33 ps
I/flutter (12151): naive clamp (simd): 3.28 ps
I/flutter (12151): naive clamp (simd array): 3.92 ps

We might want to apply some sort of worker-wrapper approach here, e.g. split methods on double and int into unboxed versions which do the work and wrappers which handle unboxing/boxing whenever possible.

Of course the original message of this issue still stands: all small functions need to be inlined.

gaaclarke commented 2 years ago

FYI I moved Flutter to its own implement of clamp: https://github.com/flutter/flutter/pull/103559

dart-lang / sdk

force inline small double/int methods, especially if parameters require boxing #40852