Investigate better bitpacking for Operand and Use

bytecodealliance / regalloc2

A new register allocator

Apache License 2.0

216 stars 38 forks source link

Investigate better bitpacking for Operand and Use #5

Open cfallin opened 3 years ago

cfallin commented 3 years ago

Two core data-structure elements, Operand and Use, are both designed to fit a relatively large amount of information in one u32. This is a performance optimization that we have found to be relatively impactful; expanding even to a u64 has a measurable impact (of at least a few percent) on compilation time.

Unfortunately, the scarcity of bits means that certain limits are lower than we would prefer. For example, we support only a 5-bit index for physical registers in each register class (so 32 integer registers and 32 float/vector registers), which may not be enough for some use-cases (though it can work for aarch64 and x64 at least). This also limits the VReg count to 1M (2^20).

We should investigate ways of, e.g., out-of-lining infrequently-used information (such as fixed-PReg constraints) to raise the limits on VRegs, PRegs, instruction count, and the like and provide enough headroom for any reasonably-imaginable use case.

Amanieu commented 3 years ago

When we remove non-SSA support (#4), we can shrink OperandKind to 1 bit instead of 2. Also, we can change the encoding of constraint + preg to take only 6 bits instead of 8:

1xxxxx => Fixed(preg)
01xxxx => Reuse(index)
000000 => Any
000001 => Reg
000010 => Stack
_ => Free for future use

This bumps the limit for the VReg index to 8M (2^23). However I don't see how we can fit explicit stack slots into this scheme.

To go even further, we would need to store auxillary Operand information in a separate u8. By moving preg/index into the second word, we would leave 26 bits for the VReg index which allows for up to 64M VRegs (the remaining 6 bits are: kind, pos, class and constraint:3). The size of Use wouldn't increase since we have a padding byte to spare. This also allows us to have up to 256 PRegs per class instead of 32, 256 explicit stack slots, etc.

struct OperandExtra(u8);

trait Function {
    // Both slices must have the same length.
    fn inst_operands(&self, insn: Inst) -> (&[Operand], &[OperandExtra]);
}

cfallin commented 3 years ago

Allowing a second u32, even, tagged with an instruction index, seems like it could be a reasonable approach to allow more information to be provided in "rare" cases (stackslots, fixed pregs). I guess at this point we're really designing a variable-length compression scheme so we should be somewhat data-driven on that question, of course.

The same approach could be used for Uses as for Operands, though any tagging of the extra word to associate with an index in the main array would have to be re-encoded to refer to use-indices per LiveRange.

As a separate thought: it could be worth measuring the runtime overhead of a fixed number of extra bits instead. It's certainly the preferable approach if cheap enough; split data structures will complexify everything else. IIRC, I did try a middle design point of 48 bits per Use -- u32 and u16 in a packed struct -- and that still had a measurable impact at one point, but possibly things have changed. (I don't have time to do this at the moment but figured I should note it for when this is considered next.)

cfallin commented 3 years ago

Ah, I found it -- I did some measurements here and found a 3% overhead when using a 64-bit Operand. That's not great but it does upper-bound things (it's the simplest solution here, everything else we've proposed is more compact). Maybe someone could measure what a 48-bit Operand would do...

Amanieu commented 1 year ago

I ran a benchmark to estimate the overhead of 64-bit Operand by adding a pad: u32 field to Operand. I was not able to find any measurable performance difference, although my results were a bit noisy.

Could you give this a try with Cranelift? If you also see no measurable difference, then I will prepare a PR to increase the Operand size to 64-bit, which will also resolve #131.

cfallin commented 1 year ago

@Amanieu I'm a bit swamped at the moment but you might be able to run this yourself? I would suggest tackling the noise issue also, at least for this decision: we've had false results before where a 1% shift hides within noise, or appears out of nowhere (i.e., false positives and negatives), on systems where we didn't explicitly isolate a core, pin frequency, and steer interrupts away. @jameysharp wrote up a good doc here on what we do. You don't have to figure out Sightglass necessarily, I'd consider a hyperfine './wasmtime.base compile file.wasm' './wasmtime.new compile file.wasm' adequate for this.

I will say that I did see a shift of several percent when I tried this... 1.5 years ago now?... and so while a lot has changed since then and it could be "free-ish" now, the onus of extra-careful proof is on the proposed change here.

One other experiment that would be useful, to evaluate the measurement setup, is to try to swing the needle further and see what the sensitivity is: add extra padding to Operand to make it 12 bytes, 16 bytes, etc up to something absurd (64 bytes?). If the measurement setup does start to show e.g. 1% at 16 bytes then that gives better confidence of small impact at 8 bytes (64 bits).