Move the split point to the first fixed use

elliottt commented 1 year ago

Address https://github.com/bytecodealliance/wasmtime/issues/7147 by moving the bundle split point to the first fixed use found after the suggested split point. This implementation only considers the first liverange that contains the split point, so there's room for improvement.

The intent with this change is to not artificially constrain the bundle produced after the split point by leaving unconstrained uses that could be kept above the split point. Splitting at the first fixed use after the suggested split point should produce a less constrained upper bundle, which will be easier to allocate.

One interesting effect from the example in the issue linked above is that the change of split point alone does not fix the problem, as we also trim the space around the split point into the spill bundle. Doing this creates the same splits that we would have arrived at before, as the second load doesn't use v1 all and thus that region is trimmed to the spill bundle. I addressed that by disabling trimming when the split point is advanced, under the assumption that the point moving means that you probably want to give the allocator another chance with both bundles.

I'm not sure that this implementation is the right path forward currently, as it might be better to perform the heuristic outside of the call to split_and_requeue_bundle, allowing us the flexibility to make the decision about trimming at that point.

Amanieu commented 1 year ago

I'm getting a 0.3% code size reduction from this in my compiler for libLLVM.so (~400KB saved out of a total code size of 142MB). Seems like a pretty significant win.

elliottt commented 1 year ago

Well I've fuzzed it for about 20 hours with 32 threads, and it hasn't found anything yet. Updating cranelift filetests with this branch shows that it fixes the case that motivated this branch in the riscv64 backend, and running those same tests on the aarch64 backend shows that it's fixed there as well.

If there's nothing bad in the runtime delta I'll merge this so that we can start fuzzing it on OSS-Fuzz, and start investigating whether or not it makes sense to look beyond the liverange that contains the split point when looking for a fixed constraint.

elliottt commented 1 year ago

I used hyperfine to test wasmtime compile for four of the sightglass benchmarks, and didn't see really consistent differences either way. Sometimes the branch with the new heuristic would win, and sometimes main would. Here are the results of one run over hex-simd, pulldown-cmark, spidermonkey, and bz2:

% ./hyperfine.sh ./wasmtime-fixed-heuristic ./wasmtime-main
[info] Running the hex-simd benchmark
Benchmark 1: ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/hex-simd/benchmark.wasm -o /dev/null
  Time (mean ± σ):      88.8 ms ±   0.4 ms    [User: 86.1 ms, System: 2.8 ms]
  Range (min … max):    88.2 ms …  89.9 ms    50 runs

Benchmark 2: ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/hex-simd/benchmark.wasm -o /dev/null
  Time (mean ± σ):      89.0 ms ±   0.4 ms    [User: 85.9 ms, System: 3.1 ms]
  Range (min … max):    88.3 ms …  89.9 ms    50 runs

Summary
  ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/hex-simd/benchmark.wasm -o /dev/null ran
    1.00 ± 0.01 times faster than ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/hex-simd/benchmark.wasm -o /dev/null
[info] Running the pulldown-cmark benchmark
Benchmark 1: ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/pulldown-cmark/benchmark.wasm -o /dev/null
  Time (mean ± σ):     201.8 ms ±   0.6 ms    [User: 197.9 ms, System: 3.8 ms]
  Range (min … max):   200.7 ms … 203.9 ms    50 runs

Benchmark 2: ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/pulldown-cmark/benchmark.wasm -o /dev/null
  Time (mean ± σ):     201.5 ms ±   0.5 ms    [User: 196.3 ms, System: 5.1 ms]
  Range (min … max):   200.6 ms … 202.8 ms    50 runs

Summary
  ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/pulldown-cmark/benchmark.wasm -o /dev/null ran
    1.00 ± 0.00 times faster than ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/pulldown-cmark/benchmark.wasm -o /dev/null
[info] Running the spidermonkey benchmark
Benchmark 1: ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/spidermonkey/benchmark.wasm -o /dev/null
  Time (mean ± σ):      4.850 s ±  0.011 s    [User: 4.787 s, System: 0.061 s]
  Range (min … max):    4.837 s …  4.864 s    4 runs

Benchmark 2: ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/spidermonkey/benchmark.wasm -o /dev/null
  Time (mean ± σ):      4.853 s ±  0.007 s    [User: 4.794 s, System: 0.057 s]
  Range (min … max):    4.845 s …  4.861 s    4 runs

Summary
  ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/spidermonkey/benchmark.wasm -o /dev/null ran
    1.00 ± 0.00 times faster than ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/spidermonkey/benchmark.wasm -o /dev/null
[info] Running the bz2 benchmark
Benchmark 1: ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/bz2/benchmark.wasm -o /dev/null
  Time (mean ± σ):      93.4 ms ±   0.3 ms    [User: 89.7 ms, System: 3.7 ms]
  Range (min … max):    92.7 ms …  94.7 ms    100 runs

Benchmark 2: ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/bz2/benchmark.wasm -o /dev/null
  Time (mean ± σ):      93.4 ms ±   0.4 ms    [User: 89.6 ms, System: 3.7 ms]
  Range (min … max):    92.7 ms …  94.7 ms    100 runs

Summary
  ./wasmtime-main compile -C parallel-compilation=n,cache=n sightglass/benchmarks/bz2/benchmark.wasm -o /dev/null ran
    1.00 ± 0.01 times faster than ./wasmtime-fixed-heuristic compile -C parallel-compilation=n,cache=n sightglass/benchmarks/bz2/benchmark.wasm -o /dev/null

cfallin commented 1 year ago

Cool, it looks like it's basically in the noise on those benchmarks, which is perfectly fine -- the wins for small functions with multiple fixed constraints, and Amanieu's observed code-size reductions, are more than enough to motivate this!

bytecodealliance / regalloc2

Move the split point to the first fixed use #165