bytecodealliance / regalloc2

A new register allocator
Apache License 2.0
217 stars 38 forks source link

Adding new API that accepts resusable context. #196

Closed jakubDoka closed 1 month ago

jakubDoka commented 2 months ago

Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress).

Context -> https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/Using.20context.20for.20.60TargetIsa.3A.3Acompile_function.60

jakubDoka commented 2 months ago

I profiled the allocator a bit and I got rid of some hashmaps in favor of using the key as an index which improved things a little bit.

cfallin commented 2 months ago

@jakubDoka could you summarize with a perf measurement (Sightglass or hyperfine of wasmtime compile is OK) the overall compile-time effect of this?

jakubDoka commented 2 months ago

I had to run the bench separately because whatever I run last performs worse (if I run two identical programs, the first run always wins) (my CPU throttles)

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     136.7 ms ±   4.5 ms    [User: 128.8 ms, System: 7.1 ms]
  Range (min … max):   132.5 ms … 157.8 ms    100 runs

[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     147.1 ms ±   4.6 ms    [User: 138.2 ms, System: 7.6 ms]
  Range (min … max):   140.1 ms … 163.2 ms    100 runs

Let me know If full perf stat could be useful

(spidermonkey)
15,471,712,064      cpu_core/cycles/u // wasmtime.old
14,390,147,910      cpu_core/cycles/u // wasmtime.new
(pulldown-cmark)
631,885,447      cpu_core/cycles/u // wasmtime.old
599,844,924      cpu_core/cycles/u // wasmtime.new
jakubDoka commented 2 months ago

The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?

I did not realize Wasmtime needs to deal with malicious code, good point, all of the changes here are guided by a simple principle: How to make the code do less. Using vec instead of BTree has many properties that CPUs like. I suspect that in try_to_insert_bundle_to_reg, replacing BTree with vec improves drastically because we mostly traverse the entries linearly for which Vec is more cache-friendly than BTree.

Now that I think about this, quadratic behavior is a problem if the function we compile is maliciously large, this is not the common case though, so what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size). What do you think @cfallin?

jakubDoka commented 2 months ago

After reverting to BTree I see a regression of ~500m cycles.

jakubDoka commented 2 months ago

After reverting this I see a regression of another ~250m cycles.

fitzgen commented 2 months ago

what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size).

Another, perhaps a little more out-there, idea is to make the whole algorithm generic over this container type, and make two instantiations: one with Vec and one with BTreeSet. Then, we try running the Vec-instantiated version first, but bail out if the length gets too long, at which point we fall back to the BTreeSet. If the length is a fixed size, we could even use an array instead of a Vec.

The failure checks and propagation overheads might dwarf any speed up though.

jakubDoka commented 1 month ago

@cfallin are there any extra modifications needed? (considering I will add the other changes in a separate pr(s))

jakubDoka commented 1 month ago

I realized there are some things I missed, so nvm

jakubDoka commented 1 month ago

Okay, @cfallin, I double-checked everything, hopefully nothing was missed (I found one thing in the diff).