Open mraleph opened 4 years ago
@sstrickl I think this might be a good place to start your inlining investigation project.
So looking at the example methods you mentioned, they're both intrinsics, and _Double.== is marked as never inline. Currently, our inliner bails out on intrinsics unless it's marked as always inline, and even if I added @pragma('vm:prefer-inline') to Double.getNegative, there's a InlineBailout in NativeCall in kernel_to_il.cc which will keep it from inlining (and the same would hold if I changed never to prefer in _Double.==).
So I'm investigating the whys of all of these decisions now, to decide whether there's anything that can be done to loosen them in a way that might allow us to inline these and similarly gatekept functions. I'd also appreciate any historical knowledge you and other long-time VM developers have about these decisions. (/cc @mkustermann )
Yeah, I think this is another example of poor architecture in our compilation pipeline:
We have intrinsified functions which are never inlined by the compiler, even though if we were to inline them we would get a smaller code. We often handle this by providing hand written IL graphs in either call specializer or inliner or graph builder (sometimes we have duplicated code due to that).
I think it would be good to have some uniformity here - have a single piece of code that defines IL graph, which can be used either to produce normal function body (e.g. to use as intrinsic) and can be inlined / used by the call specialiser.
We should definitely review possible approaches here. Currently we are unable to unbox method receivers which is causing us some code quality issues on methods of double
and int
. When I look at the graph of BoxConstraints.enforce
for example I see the following:
Note that all BoxConstraints
fields are unboxed which causes them to be reboxed just to be passed down to statically invoked double.clamp
. This is a lot of overhead (the implementation of double.clamp
itself is potentially slow but that's a separate issue: https://github.com/dart-lang/sdk/issues/46879).
I have written a microbenchmark which tries to establish the cost, and the boxing is probably contributing 100 ps per invocation (on a slower Android ARM32 device):
I/flutter (18964): double.clamp: 318.80 ps
I/flutter (18964): naive clamp: 11.55 ps
I/flutter (18964): naive clamp (arg check): 20.36 ps
I/flutter (18964): naive clamp (boxed): 98.43 ps
I/flutter (18964): naive clamp (simd): 3.32 ps
I/flutter (18964): naive clamp (simd array): 6.13 ps
On a ARM64 device (faster, but downscaled to 1132800):
I/flutter (12151): double.clamp: 197.60 ps
I/flutter (12151): naive clamp: 7.38 ps
I/flutter (12151): naive clamp (arg check): 13.85 ps
I/flutter (12151): naive clamp (boxed): 94.33 ps
I/flutter (12151): naive clamp (simd): 3.28 ps
I/flutter (12151): naive clamp (simd array): 3.92 ps
We might want to apply some sort of worker-wrapper approach here, e.g. split methods on double
and int
into unboxed versions which do the work and wrappers which handle unboxing/boxing whenever possible.
Of course the original message of this issue still stands: all small functions need to be inlined.
FYI I moved Flutter to its own implement of clamp: https://github.com/flutter/flutter/pull/103559
When looking at native code produced for
scaleRadii
from Flutter dart:ui, I see us emitting code like this:and
This generates more code then we would get by ensuring that methods
isNegative
and==
are inlined.We should investigate why and how often it happens and ensure that they are inlined.
/cc @mkustermann @alexmarkov @sjindel-google