Code optimization issues in generic binary search benchmark

This is notes from my investigation into the performance of the binary search code written by @scheglov

We are comparing three different implementation of the algorithm:

Generic version in Dart which searches for a value of type E in the given List<E> using configurable comparator and projection functions.
Specialized version in Dart which searches for an int value in the given Int64List.
Specialized version in Rust which searches for an i64 values in the given &[i64].

These implementations are benchmarked using the following loop:

final Int64List searchIn;  // sorted array of length 1024 * 1024 * 128
final Int64List searchFor; // array of length 1000 containing data to search

{  // Benchmark iteration
  for (var i = 0; i < searchFor.length; i++) {
    binarySearch(searchIn, searchFor[i]);
  }
}

Searching for a new value on each iteration allows us to minimise the effect of branch prediction and CPU caches on the benchmark results. We measure 3 different variants of this benchmark: with 1000, 100 and 10 different search values.

final searchForVariants = <int, Int64List>{
  for (var k in [1, 10, 100, 1000])
    k: Int64List.fromList(
      List.generate(searchFor.length, (i) => searchFor[i % k]),
    ),
};

Full code of the benchmarks is available here.

Initial benchmark results look like this (Dart is measured in AOT mode):

	Implementation	Variant	μs/iter
Dart	GenericBinarySearch	1	423
Dart	GenericBinarySearch	10	440
Dart	GenericBinarySearch	100	548
Dart	GenericBinarySearch	1000	757
Dart	BinarySearch	1	60
Dart	BinarySearch	10	56
Dart	BinarySearch	100	58
Dart	BinarySearch	1000	460
Rust	BinarySearch	1	40
Rust	BinarySearch	10	47
Rust	BinarySearch	100	155
Rust	BinarySearch	1000	504

[!NOTE]

Notice how memory / branch prediction effects completely dominate this benchmark. In general the rest of the issue considers itself only with issues in the generated machine code caused by deficiencies in our Dart VM's backend. If you are interested in the topic of optimizing binary search for modern hardware consider reading Algorithms for Modern Hardware - 12.1 Binary Search (by Sergey Slotin), which gives a well rounded explanation of various issues one must tackle. For deeper exposure check Array Layouts for Comparison-Based Searching (by Paul Khuong and Pat Morin). The following Paul's posts are also of interest:

Binary search eliminates branch mispredictions

Binary search is a pathological case for caches

Retrospective on binary search and comp{ress,ilat}ion

Understanding `GenericBinarySearch` slowness

Profiling with perf reveals the following hot code in GenericBinarySearch.binarySearch:

       │     GenericCheckBound:50(v135 T{_Smi}, v61) int64
  2.00 │       cmp  rbx,rax
       │     ↓ jae  28c
       │     v156 <- LoadIndexed(v7 T{_Int64List}, v61 T{int}) int64
 42.14 │       mov  r10,QWORD PTR [rsi+r9*8+0x17]
       │     v139 <- BoxInt64(v156 T{int}) T{int}
  0.08 │       mov  rax,r10
  1.95 │       add  rax,rax
  2.78 │     ↓ jno  165
       │     → call stub _iso_stub_AllocateMintSharedWithoutFPURegsStub
       │       mov  QWORD PTR [rax+0x7],r10
       │     MoveArgument(v25 T{_Closure}, SP+1)
  0.00 │165:   mov  r11,QWORD PTR [rbp-0x10]
  0.58 │       mov  QWORD PTR [rsp+0x8],r11
       │     MoveArgument(v139 T{int}, SP+0)
  1.83 │       mov  QWORD PTR [rsp],rax
       │     ParallelMove rax <- S-2
  0.49 │       mov  rax,QWORD PTR [rbp-0x10]
       │     v63 <- ClosureCall:56( closure=v25<0>, v25, v139 T{int})
  0.01 │       mov  r10,QWORD PTR [r15+0xc7]
  0.32 │       mov  rcx,QWORD PTR [rax+0x37]
  2.40 │     → call rcx

[!NOTE]
You can get IL comments automatically interspersed with your perf profiles if you use --dwarf-stack-traces --code-comments --write-code-comments-as-synthetic-source-to=/tmp/comments.txt when compiling your AOT snapshot.

This corresponds to the following code in Dart:

var element = sortedList[mid];
var comp = compare(keyOf(element), key);

This means that keyOf did not get inlined. Looking at IL dump (--print-flow-graph --print-flow-graph-filter=GenericBinarySearch_binarySearch) reveals the following code

 v22 <- Constant(#Closure: <Y0>(Y0) => Y0 from Function 'identity': static.) 

 v23 <- LoadField(v22 . Closure.function {final}) T{Function}
 v24 <- LoadField(v22 . Closure.context {final}) T{Context}
 StaticCall:58( _boundsCheckForPartialInstantiation@9040228<0> v22, v18)

 v25 <- AllocateClosure:56(v23, v24) T{_Closure}

This code corresponds to partial instantiation of identity, which is generic function T Function<T>(T value) assigned to the parameter of type E Function(E element) keyOf.

We can observe the following problems:

[ ] v22 is a constant, but we fail to evaluate v22.function and v22.context which makes instantiation heavier than necessary.
- This is due to missing code LoadFieldInstr::TryEvaluateLoad.
[ ] We don't eliminate _boundsCheckForPartialInstantiation (which is a heavy runtime call), even though the underlying closure does not actually have any type parameter bounds.
- This is due to a missing canonicalisation rule. This can also be handled in front-end.
[ ] We don't inline binarySearch into the caller which would allow us to specialise the code for E == int
- This is due to a bug in the inliner which does incorrect argument validity check, which always fails for generic functions.
[ ] (Minor) we don't seem to forward E == int in TFA, even though that's the only invocation and the type argument is fixed.

Fixing the first three problems and rerunning benchmark shows 1.6x improvement. The hot code changes to

       │     GenericCheckBound:50(v196 T{_Smi}, v64)
  0.23 │       cmp  rbx,rax
       │     ↓ jae  1ba
       │     v219 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
 37.04 │       mov  rax,QWORD PTR [rdx+rcx*8+0x17]
       │     ParallelMove rdx <- rax
  0.07 │       mov  rdx,rax
       │     v200 <- BoxInt64(v219 T{int}) T{int}
  2.91 │       mov  rax,rdx
  3.32 │       add  rax,rax
  6.92 │     ↓ jno  e0
       │     → call stub _iso_stub_AllocateMintSharedWithoutFPURegsStub
       │       mov  QWORD PTR [rax+0x7],rdx
       │     v242 <- LoadClassId(v200 T{int}) int64
  0.02 │ e0:   test al,0x1
  0.37 │       mov  edx,0x3b
  3.88 │     ↓ je   ef
       │       mov  edx,DWORD PTR [rax-0x1]
       │       shr  edx,0xc
       │     MoveArgument(v200 T{int}, SP+1)
  0.09 │ ef:   mov  QWORD PTR [rsp+0x8],rax
       │     MoveArgument(v195 T{int}, SP+0)
  1.33 │       mov  QWORD PTR [rsp],r10
       │     ParallelMove rcx <- rdx, rax <- rcx
  0.19 │       mov  rax,rcx
  3.54 │       mov  rcx,rdx
       │     v156 <- DispatchTableCall( cid=v242 Comparable.compareTo<0>, v200 T{int}, v195 T{int}) int64
  0.05 │       mov  rax,QWORD PTR [r14+0x58]
  5.41 │     → call QWORD PTR [rax+rcx*8+0x480]

Now we successfully inline keyOf (which equals identity) and compare (which equals defaultCompare), but we fail to devirtualize int.compareTo even if we know that receiver is an integer value (v200 <- BoxInt64(v219)).

This happens because call specializer is unnecessary scared by invocations in classes, which are used as interfaces (i.e. int is actually an interface implemented by _IntegerImplementation). This currently completely disables devirtualization logic in AotCallSpecializer::VisitInstanceCall.

[ ] Fix AotCallSpecializer::VisitInstanceCall to apply devirtualization logic to calls through interfaces as well.

Fix this converts DispatchTableCall(Comparable.compareTo, ...) into a StaticCall( _IntegerImplementation.compareTo, ...) but this function is still not inlined. Looking at --trace-inlining output reveals the following about compareTo:

     Mark not inlinable
     Bailout: heuristics (default) with code size:  38, call sites: 3, inlining depth of callee: 0, const args: 0

This function is polymorphic with respect to its argument:

int compareTo(num other) {
  if (other is double) {
    // handle integer comparison with double
  }
  // handle integer comparison with integer
}

Inliner should be able to specialise it for the context which passes int as an argument, but it does not. If we look at IL we see that other is double became:

v199 <- UnboxedConstant(#61) int64
v169 <- Parameter(1) T{int}
v171 <- AssertAssignable:12(v169, v170, 'other', instantiator_type_args(v164), function_type_args(v164)) T{int}
v198 <- LoadClassId(v171) int64
Branch if EqualityCompare(v198 == v199) T{bool} goto (65, 77)

There are two problems here:

[ ] AssertAssignable is not eliminated when constructing callee graph - this happens because inliner creates ParameterInstr with representation set to kNoRepresentation and Canonicalization avoids canonicalising any instructions which have unmatched input representations and gurded inputs. This is minor problem in context of this microbenchmark - but might mean that inliner is looking at larger than necessary callee graphs.
[ ] We don't eliminate EqualityCompare(LoadClassId(v171) == kDoubleCid) even though we know that it would never succeed.

Fixing this last problem using @rmacnak-google's recently added range analysis for LoadClassId makes inlining of _IntegerImplementation.compareTo succeed:

Success
  with reason need to count first, code size 8, call sites: 0

This improves benchmark runtime:

GenericBinarySearch.1(RunTime): 105.96402224492113 us.
GenericBinarySearch.10(RunTime): 89.3001067995728 us.
GenericBinarySearch.100(RunTime): 184.5623735050598 us.
GenericBinarySearch.1000(RunTime): 503.161 us.

Looking at the hot code reveals that the inner (hot) loop is reasonably specialised, however there is one obvious redundancy. compareTo followed by the if chain became:

         :   v290 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
   52.04 :   aff047: mov    rbx,QWORD PTR [rdx+r13*8+0x17]
         :   Branch if RelationalOp(<, v290, v288) goto (79, 80)
    0.87 :   aff04c: cmp    rbx,r10
    3.93 :   aff04f: jge    aff061 <GenericBinarySearch.run+0x8d>
         :   B79
         :   ParallelMove rbx <- C goto:134 B86
    3.48 :   aff055: mov    rbx,0xffffffffffffffff
    0.16 :   aff05c: jmp    aff076 <GenericBinarySearch.run+0xa2>
         :   B80
         :   Branch if RelationalOp(>, v290, v288) goto (81, 82)
    2.86 :   aff061: cmp    rbx,r10
    0.81 :   aff064: jle    aff074 <GenericBinarySearch.run+0xa0>
         :   B81
         :   ParallelMove rbx <- C goto:150 B86
    0.60 :   aff06a: mov    ebx,0x1
    2.72 :   aff06f: jmp    aff076 <GenericBinarySearch.run+0xa2>
         :   B82
         :   ParallelMove rbx <- C goto:156 B86
    0.98 :   aff074: xor    ebx,ebx
         :   B86
         :   v271 <- BoxInt64(v221) [-1, 1] T{_Smi}
    0.00 :   aff076: mov    rax,rbx
    2.97 :   aff079: add    rax,rax
         :   Branch if StrictCompare(===, v271, v5) goto (21, 22)
    1.22 :   aff07c: test   rax,rax
    0.14 :   aff07f: je     aff0a6 <GenericBinarySearch.run+0xd2>
         :   B21
         :   B22
         :   Branch if RelationalOp(<, v221, v287) goto (23, 24)
    0.18 :   aff085: cmp    rbx,0x0
    3.08 :   aff089: jge    aff09e <GenericBinarySearch.run+0xca>

Not only these branches did not get fused together, but we also introduced BoxInt64 + StrictCompare for a.compareTo(b) == 0. There are few improvements that can be made to the code:

[ ] Avoid emitting StrictCompare if incoming argument is already unboxed.
[ ] Push integer comparisons upwards through a Phi with constant arguments. Existing BranchSimplifier pass can't handle this code because it only triggers on single-use Phis.

When looking at this code I have also noticed two other things:

[ ] We something have Redefinition(v) instructions left in the graph where compile type of v is clearly assignable to constrained type of the redefinition (e.g. constrained type is int? and the compile type of v is _Smi). Such redefinitions can be removed.
[ ] We don't fuse Branch Comparison(a, b) === true pattern into Branch Comparison(a, b) eagerly enough because at early stages of compilation Comparison(a, b) often has unmatched representation (e.g. requires unboxed integer values, but unboxing instructions are not yet inserted). This only makes sense when comparison is speculative.

After I have prototyped an improvement to BranchSimplify pass which could fuse a.compareTo(b) with subsequent if-then-else cascade I got the following result:

GenericBinarySearch.1(RunTime): 33.111468077175665 us.
GenericBinarySearch.10(RunTime): 43.18818799449605 us.
GenericBinarySearch.100(RunTime): 53.623792524149515 us.
GenericBinarySearch.1000(RunTime): 447.79553072625697 us

This brings us to the same level as BinarySearch written in Dart. The hot loop now looks like this:

       │    B27
       │    CheckStackOverflow:108(stack=0, loop=1)
  0.35 │42:   cmp  rsp,QWORD PTR [r14+0x38]
       │    ↓ jbe  b6
       │    Branch if RelationalOp(<, v54, v55) T{bool} goto (40, 28)
  1.13 │4c:   cmp  r12,r9
       │    ↓ jge  9b
       │    B40
       │    ParallelMove r13 <- r9
  0.93 │      mov  r13,r9
       │    v62 <- BinaryInt64Op(- [tr], v55, v54) int64
 13.74 │      sub  r13,r12
       │    v63 <- ShiftInt64Op(>> [tr], v62, v238 T{_Smi}) int64
  0.10 │      sar  r13,1
       │    ParallelMove rbx <- r12
  1.11 │      mov  rbx,r12
       │    v64 <- BinaryInt64Op(+ [tr], v54, v63) int64
  0.91 │      add  rbx,r13
       │    ParallelMove rax <- rdi, rbx <- rbx, r13 <- rbx
 14.49 │      mov  rax,rdi
  0.19 │      mov  r13,rbx
       │    GenericCheckBound:50(v215 T{_Smi}, v64) int64
  1.27 │      cmp  rbx,rax
       │    ↓ jae  bf
       │    v239 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
 33.00 │      mov  rbx,QWORD PTR [rdx+r13*8+0x17]
       │    Branch if RelationalOp(<, v239, v237 T{int}) goto (23, 80)
  3.10 │      cmp  rbx,r10
 11.10 │    ↓ jge  8d
       │    B23
       │    ParallelMove rax <- r13
  0.35 │      mov  rax,r13
       │    v73 <- BinaryInt64Op(+ [tr], v64 T{int}, v238 T{_Smi}) int64
  0.26 │      add  rax,0x1
       │    ParallelMove r12 <- rax goto:98 B25
  0.17 │      mov  r12,rax
  2.90 │    ↑ jmp  42
       │    B80
       │    Branch if RelationalOp(>, v239, v237 T{int}) goto (24, 21)
  0.98 │8d:   cmp  rbx,r10
  0.61 │    ↓ jle  9b
       │    B24
       │    ParallelMove r9 <- r13 goto:100 B25
  0.84 │      mov  r9,r13
 11.09 │    ↑ jmp  42

We could do a bit more here, e.g. fuse both RelationalOp comparisons into a single one, but at this point we are entering the territory of diminishing returns.

Other observations

`GenericCheckBound` issues

I noticed that the same version of binarySearch has different performance characteristics depending on how we measure its performance. Performance is worse if we write a measuring loop manually instead of using benchmark_harness:

final benchmark = BinarySearch(1, searchIn, searchForVariants[1]!);
for (var j = 0; j < 10; j++) {
  final sw = Stopwatch()..start();
  for (var i = 0; i < 10000; i++) {
    benchmark.run();
  }
  print(sw.elapsedMilliseconds);
}

Yields 64 us per benchmark.run invocation compared to 33 us reported by benchmark_harness. This difference might be caused by loop alignment, but while analyzing this difference I have also noticed that GenericCheckBound has problematic fixed-register constraints on its inputs.

       │    ParallelMove rax <- S-4, rbx <- r8
  0.67 │      mov  rax,QWORD PTR [rbp-0x20]
 11.24 │      mov  rbx,r8
       │    GenericCheckBound:28(v173 T{_Smi}, v108) int64
  0.76 │      cmp  rbx,rax
       │    ↓ jae  328
       │    v202 <- LoadIndexed(v61, v108 T{int}) int64
 37.42 │      mov  rax,QWORD PTR [rdx+r8*8+0x17]
       │    Branch if EqualityCompare(v202 T{int} == v172 T{int}) T{bool} goto (35, 36)
  8.95 │      cmp  rax,rcx
  0.45 │    ↓ je   23f

Having fixed register constraints usually means:

Additional parallel moves emitted shuffling values around.
If the length value got spilled (e.g. there are multiple nested loops and length value got hoisted out), it will end up used directly from the spill slot even if register pressure inside the innermost loop allows to keep it in the register. This happens because rax <- P move created by the register allocator when resolving fixed-register input constraints does not require incoming value to be in a register - and register allocator is eager to resolve all such uses to use a spilled value if the value is already spilled and there are no requires register uses left.
[ ] Evaluate if GenericCheckBound could use non-fixed inputs
[ ] Evaluate if register allocator should not be too eager at using spill slot for R <- P moves when inside loops.

Block Scheduling

When looking at the generated code for non-generic binarySearch:

@pragma('vm:never-inline')
static int binarySearch(Int64List values, int value) {
  var min = 0;
  var max = values.length;
  while (min < max) {
    final mid = (min + max) >> 1;
    final element = values[mid];
    if (element == value) {
      return mid;
    } else if (element < value) {
      min = mid + 1;
    } else {
      max = mid;
    }
  }
  return -1;
}

We can notice a suboptimal code placement: terminal block (one corresponding to return mid) occurs in the middle of the hot loop instead of being emitted after the loop (see B3):

       │    B9
       │    CheckStackOverflow:74(stack=0, loop=1)
  0.52 │19:┌─→cmp  rsp,QWORD PTR [r14+0x38]
       │   │↓ jbe  80
       │   │Branch if RelationalOp(<, v6, v7) T{bool} goto (8, 10)
  0.00 │23:│  cmp  r8,rdi
       │   │↓ jge  74
       │   │B8
       │   │ParallelMove r9 <- r8
       │   │  mov  r9,r8
       │   │v14 <- BinaryInt64Op(+ [tr], v6, v7) int64
 12.28 │   │  add  r9,rdi
       │   │v15 <- ShiftInt64Op(>> [tr], v14, v40 T{_Smi}) int64
  0.58 │   │  sar  r9,1
       │   │ParallelMove rax <- rdx, rbx <- r9
  0.00 │   │  mov  rax,rdx
  0.99 │   │  mov  rbx,r9
       │   │GenericCheckBound:28(v34 T{_Smi}, v15) int64
 21.19 │   │  cmp  rbx,rax
       │   │↓ jae  89
       │   │v41 <- LoadIndexed(v2, v15 T{int}) int64
 26.32 │   │  mov  rbx,QWORD PTR [rcx+r9*8+0x17]
       │   │Branch if EqualityCompare(v41 T{int} == v3) T{bool} goto (3, 4)
  0.44 │   │  cmp  rbx,rsi
  8.93 │   │↓ jne  5a
       │   │B3
       │   │ParallelMove rax <- r9
       │   │  mov  rax,r9
       │   │Return:42(v15 T{int})
       │   │  mov  rsp,rbp
  0.00 │   │  pop  rbp
  0.51 │   │← ret
       │   │B4
       │   │Branch if RelationalOp(<, v41 T{int}, v3) T{bool} goto (5, 6)
       │5a:│  cmp  rbx,rsi
 14.38 │   │↓ jge  6f
       │   │B5
       │   │ParallelMove rbx <- r9
       │   │  mov  rbx,r9
       │   │v21 <- BinaryInt64Op(+ [tr], v15 T{int}, v40 T{_Smi}) int64
       │   │  add  rbx,0x1
       │   │ParallelMove r8 <- rbx, rdi <- rdi goto:64 B7
       │   │  mov  r8,rbx
       │   │↑ jmp  19
       │   │B6
       │   │ParallelMove r8 <- r8, rdi <- r9 goto:66 B7
       │6f:│  mov  rdi,r9
 12.68 │   └──jmp  19

This is due to very simplistic algorithm used by block_scheduler in AOT: it only attempts to move always throwing code to the end, but does not attempt to lay out loops tightly. The code is simply emitted in the reverse postorder.

[ ] Improve block scheduling in AOT to keep loop blocks together and not intersperse hot and cold code.

dart-lang / sdk