bytecodealliance / regalloc2

A new register allocator
Apache License 2.0
208 stars 36 forks source link

Support for cold blocks #22

Open Amanieu opened 2 years ago

Amanieu commented 2 years ago

It would be useful to be able to mark some blocks as "cold" which means that they are rarely taken cold paths. The register allocator should prefer placing spills and moves in cold blocks if possible.

It turns out that very little needs to be done if we take advantage of the block ordering by requiring all cold blocks to be after normal blocks in terms of block index and instruction indices. This has the following consequences:

My only concern is that the block order will no longer be in RPO which is the ordering recommended by the documentation. While regalloc2 will still function properly, I am less sure of the impact it may have on the heuristics.

Note that there are no requirements related to the ordering of blocks, and there is no requirement that the control flow be reducible. Some heuristics used by the allocator will perform better if the code is reducible and ordered in reverse postorder (RPO), however: in particular, (1) this interacts better with the contiguous-range-of-instruction-indices live range representation that we use, and (2) the "approximate loop depth" metric will actually be exact if both these conditions are met.

bjorn3 commented 2 years ago

FYI there is already an issue on the wasmtime side: https://github.com/bytecodealliance/wasmtime/issues/2747

cfallin commented 2 years ago

@Amanieu I think that placing cold blocks at the end of the function in the linear block order should basically just work, as you say.

The two issues in the design doc could potentially have an impact on compile time (first issue -- because we'll have longer, discontiguous live ranges) and code quality (second issue -- because an out-of-lined block from an inner loop, sunk to the end, could cause the approximate metric to treat the entire remainder of the function as a hot inner loop body).

For the second, I think a simple-enough answer is to stop the approximate-loop-depth scan before any code blocks (and treat them as zero depth). That would also have a side-effect of making the spill cost low in the cold paths, which is what we want.

So I can imagine adding a method to the Function trait something like fn first_cold_block(&self) -> Option<Block> and then use that to end this scan early, then that should be it. Does that seem reasonable to you?