cranelift: Delete scalar `{u,s}loadNN` and `storeNN` instructions

afonso360 commented 3 months ago

👋 Hey,

This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this. (Unfortunately these now clash with some other stuff and I'm unable to attend)

I haven't done any benchmarking, and I just noticed there are some cases where we are not correctly fusing the load+extend. Once that is working correctly I'll try to run sightglass on it.

CC: https://github.com/bytecodealliance/wasmtime/issues/6056

alexcrichton commented 3 months ago

For context I brought this up in the Cranelift meeting today as I was curious about the genesis of the instructions. My assumption was that these were added under the assumption that the pattern of load+extend or ireduce+store would be so common that having fused ops in Cranelift would reduce the size of the resident IR during compilation and perhaps have other various memory/compile-time benefits. The conclusion though was that while this was probably the predicted purpose of the instructions no one was aware of any benchmarking one way or another to show the impact.

I think it'd be worthwhile to put this through sightglass to ensure there's not, for example, a 10% slowdown compiling spidermonkey, but otherwise I think we're all in the abstract all for cleanups that simplify the IR.

afonso360 commented 3 months ago

Ran some quick benchmarks on this, doesn't seem to make too much of a difference, except for spidermonkey where there is a slight compile time regression.

I should also note that this version suffers from the issue described in https://github.com/bytecodealliance/wasmtime/issues/8787, but it looks like that doesn't affect larger programs too much? It does affect the test cases which slightly bothers me.

Sightglass results

``` Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.10s Running `target/debug/sightglass-cli benchmark --engine ./main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 2 -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm` . compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 250476630.15 ± 157064794.89 (confidence = 99%) main.so is 1.01x to 1.03x faster than delete-special-mem.so! [14521885186 14938969886.45 15446520641] delete-special-mem.so [14558278944 14688493256.30 15396563040] main.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [528672 687515.20 1726784] delete-special-mem.so [528352 761440.00 2136608] main.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [200704 226676.80 266368] delete-special-mem.so [196128 242470.40 326496] main.so instantiation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [317088 364465.60 469152] delete-special-mem.so [314976 380755.20 560736] main.so compilation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [590328960 645099580.55 788561440] delete-special-mem.so [603675135 673634577.60 887706080] main.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [124808576 134569985.05 199907890] delete-special-mem.so [122697376 130622858.45 157331714] main.so instantiation :: cycles :: benchmarks/regex/benchmark.wasm No difference in performance. [513856 599899.20 1005408] delete-special-mem.so [459328 615880.00 1086016] main.so compilation :: cycles :: benchmarks/regex/benchmark.wasm No difference in performance. [1356884384 1438567822.70 1639913438] delete-special-mem.so [1371897536 1467483544.05 1656487360] main.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [260681040 299165913.20 481644877] delete-special-mem.so [261796512 294465843.00 491744160] main.so execution :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [9724864 10999916.80 13853504] delete-special-mem.so [10061056 10919562.45 12255328] main.so execution :: cycles :: benchmarks/regex/benchmark.wasm No difference in performance. [284849472 298366723.55 319585712] delete-special-mem.so [281981504 296739732.85 332067262] main.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [1437751471 1491664753.05 1615952144] delete-special-mem.so [1434106066 1487088603.50 1577413794] main.so ```

cfallin commented 3 months ago

Thanks for running those benchmarks! IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change (others may disagree, happy to discuss; in the CL meeting this morning I gave a clean 1% as an example of a number I'd personally be fine with, 10% as an example clearly not, 3% is somewhere in the middle).

I wonder if it gets any better if we do handle merging better, along the lines of #8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

afonso360 commented 3 months ago

IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change

Yeah, I don't think it's worth it if we have this regression.

I wonder if it gets any better if we do handle merging better, along the lines of https://github.com/bytecodealliance/wasmtime/issues/8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

Maybe, I'm not too familiar with how we do elaboration on egraphs so it might be slightly harder for me to pick that up.

Reading https://github.com/bytecodealliance/wasmtime/issues/6154 got me interested in the jump-threading pass idea, even though I think moving the extends next to the loads would be more effective in this case. I probably won't have time to look at it now but maybe later.

afonso360 commented 3 months ago

By the way, I was doing some unrelated benchmarking and found out that sightglass is not consistent enough on my machine to give me confidence that the result above is real / meaningful.

I'm going to try to run the benchmarks again on a more stable setup when I get some time, although I suspect it's probably going to amount to the same.

fitzgen commented 3 months ago

@afonso360 you could try measuring instructions retired instead of cycles, pass something like -m insts-retired to sightglass. It should be much more consistent, and this change isn't the kind of thing where we instructions retired is a poor proxy for performance (as opposed to when measuring changes that are intended to improve cache locality, for example).

afonso360 commented 3 months ago

I re-ran the benchmarks with -m insts-retired and some of the steps in cpu-isolation.md. It's still not 100% reproducible across runs, but it's a lot better.

However the results are now completely different from the results above.

Sightglass Results

``` afonso@fedora:~/git/sightglass$ taskset --cpu-list 3 cargo run -- benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"` note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s Running `target/debug/sightglass-cli benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm` execution :: instructions-retired :: benchmarks/regex/benchmark.wasm Δ = 53721614.20 ± 3668.44 (confidence = 99%) -main.so is 1.07x to 1.07x faster than .so! [737583027 737586621.20 737590770] -main.so [791305114 791308235.40 791313975] .so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 15985756.30 ± 2.10 (confidence = 99%) -main.so is 1.07x to 1.07x faster than .so! [227672476 227672479.20 227672481] -main.so [243658234 243658235.50 243658239] .so instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 2971.00 ± 2002.10 (confidence = 99%) -main.so is 1.02x to 1.11x faster than .so! [43088 45173.90 49561] -main.so [46621 48144.90 49943] .so execution :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm Δ = 976271.20 ± 336.81 (confidence = 99%) -main.so is 1.05x to 1.05x faster than .so! [19771797 19771883.20 19772465] -main.so [20747941 20748154.40 20748599] .so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 29346100.10 ± 77173.30 (confidence = 99%) -main.so is 1.01x to 1.01x faster than .so! [2660295835 2660354378.00 2660420572] -main.so [2689612009 2689700478.10 2689780017] .so compilation :: instructions-retired :: benchmarks/regex/benchmark.wasm Δ = 80572.40 ± 75315.03 (confidence = 99%) -main.so is 1.00x to 1.00x faster than .so! [37695341 37800338.60 37888633] -main.so [37747142 37880911.00 37960870] .so instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [46462 71324.50 168586] -main.so [46932 47315.60 47940] .so instantiation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [37923 60559.60 69184] -main.so [63367 68729.70 72940] .so instantiation :: instructions-retired :: benchmarks/regex/benchmark.wasm No difference in performance. [35647 36528.00 37098] -main.so [35728 41068.30 60408] .so compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm No difference in performance. [4079658 4110764.30 4134315] -main.so [4079041 4097484.70 4118287] .so compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [374729187 375180105.70 375606520] -main.so [375041594 375590062.10 376019380] .so compilation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [15772628 15798740.50 15873683] -main.so [15784192 15808475.30 15841131] .so ```

Weirdly there is now no longer any compilation difference, but there are execution differences.

I'm going to close this for now, however I'm planning on redoing these benchmarks when we have a jump threading pass, which I'm currently working on (slowly).

bytecodealliance / wasmtime

cranelift: Delete scalar `{u,s}loadNN` and `storeNN` instructions #8785