Closed jakubDoka closed 1 month ago
I profiled the allocator a bit and I got rid of some hashmaps in favor of using the key as an index which improved things a little bit.
@jakubDoka could you summarize with a perf measurement (Sightglass or hyperfine of wasmtime compile
is OK) the overall compile-time effect of this?
I had to run the bench separately because whatever I run last performs worse (if I run two identical programs, the first run always wins) (my CPU throttles)
[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 136.7 ms ± 4.5 ms [User: 128.8 ms, System: 7.1 ms]
Range (min … max): 132.5 ms … 157.8 ms 100 runs
[I] ::<~/p/r/wasmtime> hyperfine --runs 100 "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 147.1 ms ± 4.6 ms [User: 138.2 ms, System: 7.6 ms]
Range (min … max): 140.1 ms … 163.2 ms 100 runs
Let me know If full perf stat could be useful
(spidermonkey)
15,471,712,064 cpu_core/cycles/u // wasmtime.old
14,390,147,910 cpu_core/cycles/u // wasmtime.new
(pulldown-cmark)
631,885,447 cpu_core/cycles/u // wasmtime.old
599,844,924 cpu_core/cycles/u // wasmtime.new
The replacement of the actual BTreeMap per physical reg with a sorted Vec makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?
I did not realize Wasmtime needs to deal with malicious code, good point, all of the changes here are guided by a simple principle: How to make the code do less. Using vec instead of BTree has many properties that CPUs like. I suspect that in try_to_insert_bundle_to_reg
, replacing BTree with vec improves drastically because we mostly traverse the entries linearly for which Vec is more cache-friendly than BTree.
Now that I think about this, quadratic behavior is a problem if the function we compile is maliciously large, this is not the common case though, so what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size). What do you think @cfallin?
After reverting to BTree I see a regression of ~500m cycles.
After reverting this I see a regression of another ~250m cycles.
what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size).
Another, perhaps a little more out-there, idea is to make the whole algorithm generic over this container type, and make two instantiations: one with Vec
and one with BTreeSet
. Then, we try running the Vec
-instantiated version first, but bail out if the length gets too long, at which point we fall back to the BTreeSet
. If the length is a fixed size, we could even use an array instead of a Vec
.
The failure checks and propagation overheads might dwarf any speed up though.
@cfallin are there any extra modifications needed? (considering I will add the other changes in a separate pr(s))
I realized there are some things I missed, so nvm
Okay, @cfallin, I double-checked everything, hopefully nothing was missed (I found one thing in the diff).
Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress).
Context -> https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/Using.20context.20for.20.60TargetIsa.3A.3Acompile_function.60