Closed alexcrichton closed 2 years ago
I did some investigation on this yesterday and today (not quite fulltime, I'm still under the weather a bit, but regalloc hacking is still the best way to pass the time...). I found three distinct things I could improve:
Most importantly, some ugly quadratic behavior with liverange splitting. The heuristic has always been "split at first conflict", and a split is always a 2-for-1 deal, not N-for-1. The test program above has a single vreg that is passed as arg0, then arg1, then arg0, then arg1, ... through a long sequence of callsites. This means that it has to be split into N pieces each of which can be put in the appropriate register. Unfortunately splitting had cost O(|bundle|), i.e. proportional to the total length of the bundle. Bad news! My fix to this issue is to "bottom out" at a limit: if a single original bundle is split more than K times (10, for now), go ahead and do an N-for-1 split into minimal pieces.
Also, during splitting, we were copying the Use
list over to the new second half, and truncating in the first, but not shrink_to_fit
'ing. So we had O(n^2) memory at the end of the run too. D'oh.
Finally, handling of call-ABI clobbers had a bit too much overhead by treating them as normal defs; I went ahead and resolved an old TODO and used the proper clobbers API, and also adopted a bitmask-based clobbers representation rather than a list. On the Cranelift side the clobbers-list is now a const
bitmask for an ABI rather than a dynamically-built thing with allocations and all the rest.
This moved the needle on compilation of the above significantly:
% perf stat ../wasmtime/target/release/wasmtime compile ~/testfile.wasm
Performance counter stats for '../wasmtime/target/release/wasmtime compile /home/cfallin/testfile.wasm':
4,206.10 msec task-clock # 1.053 CPUs utilized
4,340 context-switches # 1.032 K/sec
822 cpu-migrations # 195.431 /sec
1,163,585 page-faults # 276.643 K/sec
16,856,753,781 cycles # 4.008 GHz (83.36%)
1,621,615,014 stalled-cycles-frontend # 9.62% frontend cycles idle (83.11%)
3,111,090,359 stalled-cycles-backend # 18.46% backend cycles idle (83.35%)
28,553,303,978 instructions # 1.69 insn per cycle
# 0.11 stalled cycles per insn (83.38%)
6,475,239,780 branches # 1.539 G/sec (83.50%)
16,905,250 branch-misses # 0.26% of all branches (83.33%)
3.995578486 seconds time elapsed
2.763566000 seconds user
1.382605000 seconds sys
% perf stat target/release/wasmtime compile ~/testfile.wasm
Performance counter stats for 'target/release/wasmtime compile /home/cfallin/testfile.wasm':
1,006.23 msec task-clock # 1.267 CPUs utilized
3,825 context-switches # 3.801 K/sec
745 cpu-migrations # 740.388 /sec
46,823 page-faults # 46.533 K/sec
4,000,880,722 cycles # 3.976 GHz (83.93%)
285,506,402 stalled-cycles-frontend # 7.14% frontend cycles idle (83.77%)
302,458,733 stalled-cycles-backend # 7.56% backend cycles idle (82.24%)
4,816,665,288 instructions # 1.20 insn per cycle
# 0.06 stalled cycles per insn (83.49%)
869,534,746 branches # 864.151 M/sec (83.48%)
11,265,004 branch-misses # 1.30% of all branches (83.27%)
0.794473768 seconds time elapsed
0.844001000 seconds user
0.143025000 seconds sys
Or in other words, 4x faster compilation and 24x fewer page faults (~= 24x less anon memory used).
In comparison, Wasmtime v0.36 (pre-regalloc2) is:
% perf stat ~/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile ~/testfile.wasm
Performance counter stats for '/home/cfallin/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile /home/cfallin/testfile.wasm':
959.79 msec task-clock # 1.233 CPUs utilized
5,047 context-switches # 5.258 K/sec
697 cpu-migrations # 726.199 /sec
58,171 page-faults # 60.608 K/sec
3,792,924,189 cycles # 3.952 GHz (83.95%)
234,549,074 stalled-cycles-frontend # 6.18% frontend cycles idle (82.94%)
258,495,205 stalled-cycles-backend # 6.82% backend cycles idle (82.15%)
5,110,076,091 instructions # 1.35 insn per cycle
# 0.05 stalled cycles per insn (83.41%)
1,102,335,350 branches # 1.149 G/sec (83.58%)
11,660,266 branch-misses # 1.06% of all branches (84.11%)
0.778638937 seconds time elapsed
0.772824000 seconds user
0.166435000 seconds sys
So v0.36 is ever-so-slightly faster (by ~5%) but curiously the current main
-with-fixes runs ~5% fewer instructions during compilation, just gets a lower IPC. Fewer pagefaults ( == less memory) in current as well. These numbers are close enough to "within noise" I'd want to measure more carefully before making strong claims here. I do feel comfortable saying "anomaly fixed and back to parity" though, given the above.
I suspect this may be the same issue we saw in #4045 as well but I haven't verified that.
I'll put up proper PRs next week, when I'm fully back; for now the branches are here (regalloc2) and here (Cranelift).
This WebAssembly file which is reduced to a single function from this issue complies like this on
main
:when compared to wasmtime 0.36.0 which is pre-regalloc2, however, this yields:
I think this means that what previously took ~200M to compile is now taking upwards of 6.5G.