bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
14.94k stars 1.25k forks source link

Cranelift: Consider rewriting small `br_table`s into conditional branch sequences #6051

Open fitzgen opened 1 year ago

fitzgen commented 1 year ago

I'm seeing a bunch of 3-5 target br_tables in spidermonkey.wasm, and I suspect that a series of conditional branches might perform better than these small br_tables and their indirect jumps. That they are showing up in spidermonkey.wasm means that LLVM's Wasm backend is regularly emitting them, so this wouldn't be too overly specific of an optimization.

Could potentially do this in the mid-end or during lowering. Haven't thought too much about it.

cfallin commented 1 year ago

We could definitely do it during lowering; we could have pseudoinstruction sequences for powers of two (2, 4, 8 targets at least) that are full binary trees, and if we have e.g. 7 targets instead then fill in the remainders with the default label.

One consideration here is to ensure that we don't lose the Spectre mitigation properties: we have a cmove to short-circuit the loaded target to default on out-of-bounds, and modern CPUs (i) guarantee that cmoves are not data-speculated, generally (or have a special incantation a la aarch64's csdb barrier to get this guarantee), and (ii) have thought specifically about the Spectre implications of indirect branch prediction and mitigated that. A tree of conditional branches may also be safe (I'm not aware of any CPUs that feed data from speculative execution back into the direction predictor) but we'd have to be sure.

afonso360 commented 1 year ago

We have similar logic in the the switch interface where it can emit trees of branches if it thinks is beneficial. We could do a quick test by replacing all br_tables with that, it also falls back to br_table when necessary.

jameysharp commented 1 year ago

If you give the switch helper in cranelift-frontend a single contiguous range of cases, it will always generate a single br_table for the whole range. The only exceptions are when there are no cases (which just jumps to the default case) or when there's exactly one case (which generates a brif to choose between the case and the default). So that isn't as helpful as it looks.

I think I'd weakly argue against doing this during lowering, because I think that will impede branch optimization. As long as we represent the search tree explicitly in CLIF, the block lowering order will be able to see that control flow and have the opportunity to fuse jump-target blocks into the tree. If we use pseudo-instructions then the branches will be emitted too late for that.