Open mraleph opened 1 year ago
I have (half baked / prototype) fixes for some of these issues, but some of these require more thought. I will continue branching out issues from this one (for individual items), if you want to jump on fixing things - just let me know :)
/cc @alexmarkov @aam @mkustermann
This is notes from my investigation into the performance of the binary search code written by @scheglov
We are comparing three different implementation of the algorithm:
E
in the givenList<E>
using configurable comparator and projection functions.int
value in the givenInt64List
.i64
values in the given&[i64]
.These implementations are benchmarked using the following loop:
Searching for a new value on each iteration allows us to minimise the effect of branch prediction and CPU caches on the benchmark results. We measure 3 different variants of this benchmark: with 1000, 100 and 10 different search values.
Full code of the benchmarks is available here.
Initial benchmark results look like this (Dart is measured in AOT mode):
Understanding
GenericBinarySearch
slownessProfiling with
perf
reveals the following hot code inGenericBinarySearch.binarySearch
:This corresponds to the following code in Dart:
This means that
keyOf
did not get inlined. Looking at IL dump (--print-flow-graph --print-flow-graph-filter=GenericBinarySearch_binarySearch
) reveals the following codeThis code corresponds to partial instantiation of
identity
, which is generic functionT Function<T>(T value)
assigned to the parameter of typeE Function(E element) keyOf
.We can observe the following problems:
v22
is a constant, but we fail to evaluatev22.function
andv22.context
which makes instantiation heavier than necessary.LoadFieldInstr::TryEvaluateLoad
._boundsCheckForPartialInstantiation
(which is a heavy runtime call), even though the underlying closure does not actually have any type parameter bounds.binarySearch
into the caller which would allow us to specialise the code forE == int
E == int
in TFA, even though that's the only invocation and the type argument is fixed.Fixing the first three problems and rerunning benchmark shows 1.6x improvement. The hot code changes to
Now we successfully inline
keyOf
(which equalsidentity
) andcompare
(which equalsdefaultCompare
), but we fail to devirtualizeint.compareTo
even if we know that receiver is an integer value (v200 <- BoxInt64(v219)
).This happens because call specializer is unnecessary scared by invocations in classes, which are used as interfaces (i.e.
int
is actually an interface implemented by_IntegerImplementation
). This currently completely disables devirtualization logic inAotCallSpecializer::VisitInstanceCall
.AotCallSpecializer::VisitInstanceCall
to apply devirtualization logic to calls through interfaces as well.Fix this converts
DispatchTableCall(Comparable.compareTo, ...)
into aStaticCall( _IntegerImplementation.compareTo, ...)
but this function is still not inlined. Looking at--trace-inlining
output reveals the following aboutcompareTo
:This function is polymorphic with respect to its argument:
Inliner should be able to specialise it for the context which passes
int
as an argument, but it does not. If we look at IL we see thatother is double
became:There are two problems here:
AssertAssignable
is not eliminated when constructing callee graph - this happens because inliner createsParameterInstr
with representation set tokNoRepresentation
andCanonicalization
avoids canonicalising any instructions which have unmatched input representations and gurded inputs. This is minor problem in context of this microbenchmark - but might mean that inliner is looking at larger than necessary callee graphs.EqualityCompare(LoadClassId(v171) == kDoubleCid)
even though we know that it would never succeed.Fixing this last problem using @rmacnak-google's recently added range analysis for
LoadClassId
makes inlining of_IntegerImplementation.compareTo
succeed:This improves benchmark runtime:
Looking at the hot code reveals that the inner (hot) loop is reasonably specialised, however there is one obvious redundancy.
compareTo
followed by theif
chain became:Not only these branches did not get fused together, but we also introduced
BoxInt64
+StrictCompare
fora.compareTo(b) == 0
. There are few improvements that can be made to the code:StrictCompare
if incoming argument is already unboxed.Phi
with constant arguments. ExistingBranchSimplifier
pass can't handle this code because it only triggers on single-usePhi
s.When looking at this code I have also noticed two other things:
Redefinition(v)
instructions left in the graph where compile type ofv
is clearly assignable to constrained type of the redefinition (e.g. constrained type isint?
and the compile type ofv
is_Smi
). Such redefinitions can be removed.Branch Comparison(a, b) === true
pattern intoBranch Comparison(a, b)
eagerly enough because at early stages of compilationComparison(a, b)
often has unmatched representation (e.g. requires unboxed integer values, but unboxing instructions are not yet inserted). This only makes sense when comparison is speculative.After I have prototyped an improvement to
BranchSimplify
pass which could fusea.compareTo(b)
with subsequentif-then-else
cascade I got the following result:This brings us to the same level as
BinarySearch
written in Dart. The hot loop now looks like this:We could do a bit more here, e.g. fuse both
RelationalOp
comparisons into a single one, but at this point we are entering the territory of diminishing returns.Other observations
GenericCheckBound
issuesI noticed that the same version of
binarySearch
has different performance characteristics depending on how we measure its performance. Performance is worse if we write a measuring loop manually instead of usingbenchmark_harness
:Yields
64 us
perbenchmark.run
invocation compared to33 us
reported bybenchmark_harness
. This difference might be caused by loop alignment, but while analyzing this difference I have also noticed thatGenericCheckBound
has problematic fixed-register constraints on its inputs.Having fixed register constraints usually means:
Additional parallel moves emitted shuffling values around.
If the length value got spilled (e.g. there are multiple nested loops and length value got hoisted out), it will end up used directly from the spill slot even if register pressure inside the innermost loop allows to keep it in the register. This happens because
rax <- P
move created by the register allocator when resolving fixed-register input constraints does not require incoming value to be in a register - and register allocator is eager to resolve all such uses to use a spilled value if the value is already spilled and there are no requires register uses left.[ ] Evaluate if
GenericCheckBound
could use non-fixed inputs[ ] Evaluate if register allocator should not be too eager at using spill slot for
R <- P
moves when inside loops.Block Scheduling
When looking at the generated code for non-generic
binarySearch
:We can notice a suboptimal code placement: terminal block (one corresponding to
return mid
) occurs in the middle of the hot loop instead of being emitted after the loop (seeB3
):This is due to very simplistic algorithm used by
block_scheduler
in AOT: it only attempts to move always throwing code to the end, but does not attempt to lay out loops tightly. The code is simply emitted in the reverse postorder.