dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.3k stars 1.59k forks source link

Code optimization issues in generic binary search benchmark #53779

Open mraleph opened 1 year ago

mraleph commented 1 year ago

This is notes from my investigation into the performance of the binary search code written by @scheglov

We are comparing three different implementation of the algorithm:

These implementations are benchmarked using the following loop:

final Int64List searchIn;  // sorted array of length 1024 * 1024 * 128
final Int64List searchFor; // array of length 1000 containing data to search

{  // Benchmark iteration
  for (var i = 0; i < searchFor.length; i++) {
    binarySearch(searchIn, searchFor[i]);
  }
}

Searching for a new value on each iteration allows us to minimise the effect of branch prediction and CPU caches on the benchmark results. We measure 3 different variants of this benchmark: with 1000, 100 and 10 different search values.

final searchForVariants = <int, Int64List>{
  for (var k in [1, 10, 100, 1000])
    k: Int64List.fromList(
      List.generate(searchFor.length, (i) => searchFor[i % k]),
    ),
};

Full code of the benchmarks is available here.

Initial benchmark results look like this (Dart is measured in AOT mode):

Implementation Variant μs/iter
Dart GenericBinarySearch 1 423
Dart GenericBinarySearch 10 440
Dart GenericBinarySearch 100 548
Dart GenericBinarySearch 1000 757
Dart BinarySearch 1 60
Dart BinarySearch 10 56
Dart BinarySearch 100 58
Dart BinarySearch 1000 460
Rust BinarySearch 1 40
Rust BinarySearch 10 47
Rust BinarySearch 100 155
Rust BinarySearch 1000 504

[!NOTE]

Notice how memory / branch prediction effects completely dominate this benchmark. In general the rest of the issue considers itself only with issues in the generated machine code caused by deficiencies in our Dart VM's backend. If you are interested in the topic of optimizing binary search for modern hardware consider reading Algorithms for Modern Hardware - 12.1 Binary Search (by Sergey Slotin), which gives a well rounded explanation of various issues one must tackle. For deeper exposure check Array Layouts for Comparison-Based Searching (by Paul Khuong and Pat Morin). The following Paul's posts are also of interest:

Understanding GenericBinarySearch slowness

Profiling with perf reveals the following hot code in GenericBinarySearch.binarySearch:

       │     GenericCheckBound:50(v135 T{_Smi}, v61) int64
  2.00 │       cmp  rbx,rax
       │     ↓ jae  28c
       │     v156 <- LoadIndexed(v7 T{_Int64List}, v61 T{int}) int64
 42.14 │       mov  r10,QWORD PTR [rsi+r9*8+0x17]
       │     v139 <- BoxInt64(v156 T{int}) T{int}
  0.08 │       mov  rax,r10
  1.95 │       add  rax,rax
  2.78 │     ↓ jno  165
       │     → call stub _iso_stub_AllocateMintSharedWithoutFPURegsStub
       │       mov  QWORD PTR [rax+0x7],r10
       │     MoveArgument(v25 T{_Closure}, SP+1)
  0.00 │165:   mov  r11,QWORD PTR [rbp-0x10]
  0.58 │       mov  QWORD PTR [rsp+0x8],r11
       │     MoveArgument(v139 T{int}, SP+0)
  1.83 │       mov  QWORD PTR [rsp],rax
       │     ParallelMove rax <- S-2
  0.49 │       mov  rax,QWORD PTR [rbp-0x10]
       │     v63 <- ClosureCall:56( closure=v25<0>, v25, v139 T{int})
  0.01 │       mov  r10,QWORD PTR [r15+0xc7]
  0.32 │       mov  rcx,QWORD PTR [rax+0x37]
  2.40 │     → call rcx

[!NOTE]
You can get IL comments automatically interspersed with your perf profiles if you use --dwarf-stack-traces --code-comments --write-code-comments-as-synthetic-source-to=/tmp/comments.txt when compiling your AOT snapshot.

This corresponds to the following code in Dart:

var element = sortedList[mid];
var comp = compare(keyOf(element), key);

This means that keyOf did not get inlined. Looking at IL dump (--print-flow-graph --print-flow-graph-filter=GenericBinarySearch_binarySearch) reveals the following code

 v22 <- Constant(#Closure: <Y0>(Y0) => Y0 from Function 'identity': static.) 

 v23 <- LoadField(v22 . Closure.function {final}) T{Function}
 v24 <- LoadField(v22 . Closure.context {final}) T{Context}
 StaticCall:58( _boundsCheckForPartialInstantiation@9040228<0> v22, v18)

 v25 <- AllocateClosure:56(v23, v24) T{_Closure}

This code corresponds to partial instantiation of identity, which is generic function T Function<T>(T value) assigned to the parameter of type E Function(E element) keyOf.

We can observe the following problems:

Fixing the first three problems and rerunning benchmark shows 1.6x improvement. The hot code changes to

       │     GenericCheckBound:50(v196 T{_Smi}, v64)
  0.23 │       cmp  rbx,rax
       │     ↓ jae  1ba
       │     v219 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
 37.04 │       mov  rax,QWORD PTR [rdx+rcx*8+0x17]
       │     ParallelMove rdx <- rax
  0.07 │       mov  rdx,rax
       │     v200 <- BoxInt64(v219 T{int}) T{int}
  2.91 │       mov  rax,rdx
  3.32 │       add  rax,rax
  6.92 │     ↓ jno  e0
       │     → call stub _iso_stub_AllocateMintSharedWithoutFPURegsStub
       │       mov  QWORD PTR [rax+0x7],rdx
       │     v242 <- LoadClassId(v200 T{int}) int64
  0.02 │ e0:   test al,0x1
  0.37 │       mov  edx,0x3b
  3.88 │     ↓ je   ef
       │       mov  edx,DWORD PTR [rax-0x1]
       │       shr  edx,0xc
       │     MoveArgument(v200 T{int}, SP+1)
  0.09 │ ef:   mov  QWORD PTR [rsp+0x8],rax
       │     MoveArgument(v195 T{int}, SP+0)
  1.33 │       mov  QWORD PTR [rsp],r10
       │     ParallelMove rcx <- rdx, rax <- rcx
  0.19 │       mov  rax,rcx
  3.54 │       mov  rcx,rdx
       │     v156 <- DispatchTableCall( cid=v242 Comparable.compareTo<0>, v200 T{int}, v195 T{int}) int64
  0.05 │       mov  rax,QWORD PTR [r14+0x58]
  5.41 │     → call QWORD PTR [rax+rcx*8+0x480]

Now we successfully inline keyOf (which equals identity) and compare (which equals defaultCompare), but we fail to devirtualize int.compareTo even if we know that receiver is an integer value (v200 <- BoxInt64(v219)).

This happens because call specializer is unnecessary scared by invocations in classes, which are used as interfaces (i.e. int is actually an interface implemented by _IntegerImplementation). This currently completely disables devirtualization logic in AotCallSpecializer::VisitInstanceCall.

Fix this converts DispatchTableCall(Comparable.compareTo, ...) into a StaticCall( _IntegerImplementation.compareTo, ...) but this function is still not inlined. Looking at --trace-inlining output reveals the following about compareTo:

     Mark not inlinable
     Bailout: heuristics (default) with code size:  38, call sites: 3, inlining depth of callee: 0, const args: 0

This function is polymorphic with respect to its argument:

int compareTo(num other) {
  if (other is double) {
    // handle integer comparison with double
  }
  // handle integer comparison with integer
}

Inliner should be able to specialise it for the context which passes int as an argument, but it does not. If we look at IL we see that other is double became:

v199 <- UnboxedConstant(#61) int64
v169 <- Parameter(1) T{int}
v171 <- AssertAssignable:12(v169, v170, 'other', instantiator_type_args(v164), function_type_args(v164)) T{int}
v198 <- LoadClassId(v171) int64
Branch if EqualityCompare(v198 == v199) T{bool} goto (65, 77)

There are two problems here:

Fixing this last problem using @rmacnak-google's recently added range analysis for LoadClassId makes inlining of _IntegerImplementation.compareTo succeed:

Success
  with reason need to count first, code size 8, call sites: 0

This improves benchmark runtime:

GenericBinarySearch.1(RunTime): 105.96402224492113 us.
GenericBinarySearch.10(RunTime): 89.3001067995728 us.
GenericBinarySearch.100(RunTime): 184.5623735050598 us.
GenericBinarySearch.1000(RunTime): 503.161 us.

Looking at the hot code reveals that the inner (hot) loop is reasonably specialised, however there is one obvious redundancy. compareTo followed by the if chain became:

         :   v290 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
   52.04 :   aff047: mov    rbx,QWORD PTR [rdx+r13*8+0x17]
         :   Branch if RelationalOp(<, v290, v288) goto (79, 80)
    0.87 :   aff04c: cmp    rbx,r10
    3.93 :   aff04f: jge    aff061 <GenericBinarySearch.run+0x8d>
         :   B79
         :   ParallelMove rbx <- C goto:134 B86
    3.48 :   aff055: mov    rbx,0xffffffffffffffff
    0.16 :   aff05c: jmp    aff076 <GenericBinarySearch.run+0xa2>
         :   B80
         :   Branch if RelationalOp(>, v290, v288) goto (81, 82)
    2.86 :   aff061: cmp    rbx,r10
    0.81 :   aff064: jle    aff074 <GenericBinarySearch.run+0xa0>
         :   B81
         :   ParallelMove rbx <- C goto:150 B86
    0.60 :   aff06a: mov    ebx,0x1
    2.72 :   aff06f: jmp    aff076 <GenericBinarySearch.run+0xa2>
         :   B82
         :   ParallelMove rbx <- C goto:156 B86
    0.98 :   aff074: xor    ebx,ebx
         :   B86
         :   v271 <- BoxInt64(v221) [-1, 1] T{_Smi}
    0.00 :   aff076: mov    rax,rbx
    2.97 :   aff079: add    rax,rax
         :   Branch if StrictCompare(===, v271, v5) goto (21, 22)
    1.22 :   aff07c: test   rax,rax
    0.14 :   aff07f: je     aff0a6 <GenericBinarySearch.run+0xd2>
         :   B21
         :   B22
         :   Branch if RelationalOp(<, v221, v287) goto (23, 24)
    0.18 :   aff085: cmp    rbx,0x0
    3.08 :   aff089: jge    aff09e <GenericBinarySearch.run+0xca>

Not only these branches did not get fused together, but we also introduced BoxInt64 + StrictCompare for a.compareTo(b) == 0. There are few improvements that can be made to the code:

When looking at this code I have also noticed two other things:

After I have prototyped an improvement to BranchSimplify pass which could fuse a.compareTo(b) with subsequent if-then-else cascade I got the following result:

GenericBinarySearch.1(RunTime): 33.111468077175665 us.
GenericBinarySearch.10(RunTime): 43.18818799449605 us.
GenericBinarySearch.100(RunTime): 53.623792524149515 us.
GenericBinarySearch.1000(RunTime): 447.79553072625697 us

This brings us to the same level as BinarySearch written in Dart. The hot loop now looks like this:

       │    B27
       │    CheckStackOverflow:108(stack=0, loop=1)
  0.35 │42:   cmp  rsp,QWORD PTR [r14+0x38]
       │    ↓ jbe  b6
       │    Branch if RelationalOp(<, v54, v55) T{bool} goto (40, 28)
  1.13 │4c:   cmp  r12,r9
       │    ↓ jge  9b
       │    B40
       │    ParallelMove r13 <- r9
  0.93 │      mov  r13,r9
       │    v62 <- BinaryInt64Op(- [tr], v55, v54) int64
 13.74 │      sub  r13,r12
       │    v63 <- ShiftInt64Op(>> [tr], v62, v238 T{_Smi}) int64
  0.10 │      sar  r13,1
       │    ParallelMove rbx <- r12
  1.11 │      mov  rbx,r12
       │    v64 <- BinaryInt64Op(+ [tr], v54, v63) int64
  0.91 │      add  rbx,r13
       │    ParallelMove rax <- rdi, rbx <- rbx, r13 <- rbx
 14.49 │      mov  rax,rdi
  0.19 │      mov  r13,rbx
       │    GenericCheckBound:50(v215 T{_Smi}, v64) int64
  1.27 │      cmp  rbx,rax
       │    ↓ jae  bf
       │    v239 <- LoadIndexed(v3 T{_Int64List}, v64 T{int}) int64
 33.00 │      mov  rbx,QWORD PTR [rdx+r13*8+0x17]
       │    Branch if RelationalOp(<, v239, v237 T{int}) goto (23, 80)
  3.10 │      cmp  rbx,r10
 11.10 │    ↓ jge  8d
       │    B23
       │    ParallelMove rax <- r13
  0.35 │      mov  rax,r13
       │    v73 <- BinaryInt64Op(+ [tr], v64 T{int}, v238 T{_Smi}) int64
  0.26 │      add  rax,0x1
       │    ParallelMove r12 <- rax goto:98 B25
  0.17 │      mov  r12,rax
  2.90 │    ↑ jmp  42
       │    B80
       │    Branch if RelationalOp(>, v239, v237 T{int}) goto (24, 21)
  0.98 │8d:   cmp  rbx,r10
  0.61 │    ↓ jle  9b
       │    B24
       │    ParallelMove r9 <- r13 goto:100 B25
  0.84 │      mov  r9,r13
 11.09 │    ↑ jmp  42

We could do a bit more here, e.g. fuse both RelationalOp comparisons into a single one, but at this point we are entering the territory of diminishing returns.

Other observations

GenericCheckBound issues

I noticed that the same version of binarySearch has different performance characteristics depending on how we measure its performance. Performance is worse if we write a measuring loop manually instead of using benchmark_harness:

final benchmark = BinarySearch(1, searchIn, searchForVariants[1]!);
for (var j = 0; j < 10; j++) {
  final sw = Stopwatch()..start();
  for (var i = 0; i < 10000; i++) {
    benchmark.run();
  }
  print(sw.elapsedMilliseconds);
}

Yields 64 us per benchmark.run invocation compared to 33 us reported by benchmark_harness. This difference might be caused by loop alignment, but while analyzing this difference I have also noticed that GenericCheckBound has problematic fixed-register constraints on its inputs.

       │    ParallelMove rax <- S-4, rbx <- r8
  0.67 │      mov  rax,QWORD PTR [rbp-0x20]
 11.24 │      mov  rbx,r8
       │    GenericCheckBound:28(v173 T{_Smi}, v108) int64
  0.76 │      cmp  rbx,rax
       │    ↓ jae  328
       │    v202 <- LoadIndexed(v61, v108 T{int}) int64
 37.42 │      mov  rax,QWORD PTR [rdx+r8*8+0x17]
       │    Branch if EqualityCompare(v202 T{int} == v172 T{int}) T{bool} goto (35, 36)
  8.95 │      cmp  rax,rcx
  0.45 │    ↓ je   23f

Having fixed register constraints usually means:

Block Scheduling

When looking at the generated code for non-generic binarySearch:

@pragma('vm:never-inline')
static int binarySearch(Int64List values, int value) {
  var min = 0;
  var max = values.length;
  while (min < max) {
    final mid = (min + max) >> 1;
    final element = values[mid];
    if (element == value) {
      return mid;
    } else if (element < value) {
      min = mid + 1;
    } else {
      max = mid;
    }
  }
  return -1;
}

We can notice a suboptimal code placement: terminal block (one corresponding to return mid) occurs in the middle of the hot loop instead of being emitted after the loop (see B3):

       │    B9
       │    CheckStackOverflow:74(stack=0, loop=1)
  0.52 │19:┌─→cmp  rsp,QWORD PTR [r14+0x38]
       │   │↓ jbe  80
       │   │Branch if RelationalOp(<, v6, v7) T{bool} goto (8, 10)
  0.00 │23:│  cmp  r8,rdi
       │   │↓ jge  74
       │   │B8
       │   │ParallelMove r9 <- r8
       │   │  mov  r9,r8
       │   │v14 <- BinaryInt64Op(+ [tr], v6, v7) int64
 12.28 │   │  add  r9,rdi
       │   │v15 <- ShiftInt64Op(>> [tr], v14, v40 T{_Smi}) int64
  0.58 │   │  sar  r9,1
       │   │ParallelMove rax <- rdx, rbx <- r9
  0.00 │   │  mov  rax,rdx
  0.99 │   │  mov  rbx,r9
       │   │GenericCheckBound:28(v34 T{_Smi}, v15) int64
 21.19 │   │  cmp  rbx,rax
       │   │↓ jae  89
       │   │v41 <- LoadIndexed(v2, v15 T{int}) int64
 26.32 │   │  mov  rbx,QWORD PTR [rcx+r9*8+0x17]
       │   │Branch if EqualityCompare(v41 T{int} == v3) T{bool} goto (3, 4)
  0.44 │   │  cmp  rbx,rsi
  8.93 │   │↓ jne  5a
       │   │B3
       │   │ParallelMove rax <- r9
       │   │  mov  rax,r9
       │   │Return:42(v15 T{int})
       │   │  mov  rsp,rbp
  0.00 │   │  pop  rbp
  0.51 │   │← ret
       │   │B4
       │   │Branch if RelationalOp(<, v41 T{int}, v3) T{bool} goto (5, 6)
       │5a:│  cmp  rbx,rsi
 14.38 │   │↓ jge  6f
       │   │B5
       │   │ParallelMove rbx <- r9
       │   │  mov  rbx,r9
       │   │v21 <- BinaryInt64Op(+ [tr], v15 T{int}, v40 T{_Smi}) int64
       │   │  add  rbx,0x1
       │   │ParallelMove r8 <- rbx, rdi <- rdi goto:64 B7
       │   │  mov  r8,rbx
       │   │↑ jmp  19
       │   │B6
       │   │ParallelMove r8 <- r8, rdi <- r9 goto:66 B7
       │6f:│  mov  rdi,r9
 12.68 │   └──jmp  19

This is due to very simplistic algorithm used by block_scheduler in AOT: it only attempts to move always throwing code to the end, but does not attempt to lay out loops tightly. The code is simply emitted in the reverse postorder.

mraleph commented 1 year ago

I have (half baked / prototype) fixes for some of these issues, but some of these require more thought. I will continue branching out issues from this one (for individual items), if you want to jump on fixing things - just let me know :)

/cc @alexmarkov @aam @mkustermann