bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
14.82k stars 1.24k forks source link

Cranelift: support >2M vregs (SSA values) with alternate `Operand` packing #8783

Open cfallin opened 3 weeks ago

cfallin commented 3 weeks ago

Currently, Cranelift with regalloc2 supports a maximum of 2^21 (2097152, a little over two million) virtual registers in VCode; in turn this limits the number of SSA values we support. This implementation limit is generally not an issue (we've lived with it for two years at least!) but sometimes does come up with giant functions produced by tooling -- for example, long initialization sequences, or giant state machines, or overeager inlining.

The origin of this limit is the dense packing of a regalloc2::Operand into 32 bits. The dense packing in turn is extremely important for performance: earlier performance experiments with regalloc2 showed in the neighborhood of 5-10% degradations from moving to a 64-bit Operand. It turns out to be very important for cache locality to have the vreg, its kind and location of mention (early/late use/def), and any constraint (physical reg, stack, reuse, ...) all in one place and in a dense list in each liverange.

That said, if the alternative is to fail allocation entirely, a little slowdown doesn't seem like such a bad alternative: so this issue proposes a hybrid solution where we instantiate regalloc2 twice, with a trait parameter that can plug in two implementations of Operand, and choose which of these to use at the Cranelift level. At the point that we are creating Operands, in a pass after lowering to build a linear array of them, we know how many vregs have been used; and earlier stages use full u32s to store vreg numbers, so we can support large counts up to that pass. The idea would be to include two Vecs in VCode -- Vec<Operand> and Vec<BigOperand> or similar -- and fill one or the other in the collection pass, then invoke a statically-monomorphized entry point to regalloc2 with a regalloc2::Function implementation on one or the other operand kind. (This may require a little trickery to "specialize" to reading from one vec or the other, maybe via an adapter struct, I'm not sure; but seems doable.)

This way, we would retain exactly the same behavior and performance as today for all functions that successfully compile today (except for an extra 24 bytes in VCode for that empty Vec); and we could support a full 32-bit range for vregs as well.