Large compile-time-memory-usage regression with regalloc2

I did some investigation on this yesterday and today (not quite fulltime, I'm still under the weather a bit, but regalloc hacking is still the best way to pass the time...). I found three distinct things I could improve:

Most importantly, some ugly quadratic behavior with liverange splitting. The heuristic has always been "split at first conflict", and a split is always a 2-for-1 deal, not N-for-1. The test program above has a single vreg that is passed as arg0, then arg1, then arg0, then arg1, ... through a long sequence of callsites. This means that it has to be split into N pieces each of which can be put in the appropriate register. Unfortunately splitting had cost O(|bundle|), i.e. proportional to the total length of the bundle. Bad news! My fix to this issue is to "bottom out" at a limit: if a single original bundle is split more than K times (10, for now), go ahead and do an N-for-1 split into minimal pieces.
Also, during splitting, we were copying the Use list over to the new second half, and truncating in the first, but not shrink_to_fit'ing. So we had O(n^2) memory at the end of the run too. D'oh.
Finally, handling of call-ABI clobbers had a bit too much overhead by treating them as normal defs; I went ahead and resolved an old TODO and used the proper clobbers API, and also adopted a bitmask-based clobbers representation rather than a list. On the Cranelift side the clobbers-list is now a const bitmask for an ABI rather than a dynamically-built thing with allocations and all the rest.

This moved the needle on compilation of the above significantly:

% perf stat ../wasmtime/target/release/wasmtime compile ~/testfile.wasm

 Performance counter stats for '../wasmtime/target/release/wasmtime compile /home/cfallin/testfile.wasm':

          4,206.10 msec task-clock                #    1.053 CPUs utilized
             4,340      context-switches          #    1.032 K/sec
               822      cpu-migrations            #  195.431 /sec
         1,163,585      page-faults               #  276.643 K/sec
    16,856,753,781      cycles                    #    4.008 GHz                      (83.36%)
     1,621,615,014      stalled-cycles-frontend   #    9.62% frontend cycles idle     (83.11%)
     3,111,090,359      stalled-cycles-backend    #   18.46% backend cycles idle      (83.35%)
    28,553,303,978      instructions              #    1.69  insn per cycle
                                                  #    0.11  stalled cycles per insn  (83.38%)
     6,475,239,780      branches                  #    1.539 G/sec                    (83.50%)
        16,905,250      branch-misses             #    0.26% of all branches          (83.33%)

       3.995578486 seconds time elapsed

       2.763566000 seconds user
       1.382605000 seconds sys

% perf stat target/release/wasmtime compile ~/testfile.wasm

 Performance counter stats for 'target/release/wasmtime compile /home/cfallin/testfile.wasm':

          1,006.23 msec task-clock                #    1.267 CPUs utilized
             3,825      context-switches          #    3.801 K/sec
               745      cpu-migrations            #  740.388 /sec
            46,823      page-faults               #   46.533 K/sec
     4,000,880,722      cycles                    #    3.976 GHz                      (83.93%)
       285,506,402      stalled-cycles-frontend   #    7.14% frontend cycles idle     (83.77%)
       302,458,733      stalled-cycles-backend    #    7.56% backend cycles idle      (82.24%)
     4,816,665,288      instructions              #    1.20  insn per cycle
                                                  #    0.06  stalled cycles per insn  (83.49%)
       869,534,746      branches                  #  864.151 M/sec                    (83.48%)
        11,265,004      branch-misses             #    1.30% of all branches          (83.27%)

       0.794473768 seconds time elapsed

       0.844001000 seconds user
       0.143025000 seconds sys

Or in other words, 4x faster compilation and 24x fewer page faults (~= 24x less anon memory used).

In comparison, Wasmtime v0.36 (pre-regalloc2) is:

% perf stat ~/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile ~/testfile.wasm

 Performance counter stats for '/home/cfallin/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile /home/cfallin/testfile.wasm':

            959.79 msec task-clock                #    1.233 CPUs utilized
             5,047      context-switches          #    5.258 K/sec
               697      cpu-migrations            #  726.199 /sec
            58,171      page-faults               #   60.608 K/sec
     3,792,924,189      cycles                    #    3.952 GHz                      (83.95%)
       234,549,074      stalled-cycles-frontend   #    6.18% frontend cycles idle     (82.94%)
       258,495,205      stalled-cycles-backend    #    6.82% backend cycles idle      (82.15%)
     5,110,076,091      instructions              #    1.35  insn per cycle
                                                  #    0.05  stalled cycles per insn  (83.41%)
     1,102,335,350      branches                  #    1.149 G/sec                    (83.58%)
        11,660,266      branch-misses             #    1.06% of all branches          (84.11%)

       0.778638937 seconds time elapsed

       0.772824000 seconds user
       0.166435000 seconds sys

So v0.36 is ever-so-slightly faster (by ~5%) but curiously the current main-with-fixes runs ~5% fewer instructions during compilation, just gets a lower IPC. Fewer pagefaults ( == less memory) in current as well. These numbers are close enough to "within noise" I'd want to measure more carefully before making strong claims here. I do feel comfortable saying "anomaly fixed and back to parity" though, given the above.

I suspect this may be the same issue we saw in #4045 as well but I haven't verified that.

I'll put up proper PRs next week, when I'm fully back; for now the branches are here (regalloc2) and here (Cranelift).

bytecodealliance / wasmtime

Large compile-time-memory-usage regression with regalloc2 #4291