dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

Revisit LSRA block sequence strategy #66994

Open kunalspathak opened 2 years ago

kunalspathak commented 2 years ago

LSRA comes up with a block order in which it would allocate register for the code inside it. It would sequence all the predecessors of a block before scheduling a block. It also compares the bbWeight of a block and gives preference to the block with higher weights. It is not uncommon to have multiple blocks of same weight (either because of lack of enough profile data or other reasons) and in such cases, we rely on the block number of the block as seen below.

https://github.com/dotnet/runtime/blob/96ef47b2c6d6591cebba4eb260763576621e8236/src/coreclr/jit/lsra.cpp#L1030-L1060

This is arguably not a good heuristic to arrange LSRA block sequencing. The bbNum are purely dependent on the how the blocks are laid out our data structure and might not represent the flow graph. The original thinking could be that the blocks with lower bbNum are executed first than the ones with higher numbers, but still, it is not the optimal strategy to have. If the blocks are renumbered, it can drastically change the allocation strategy because of different block sequencing coming out of it. This was seen in https://github.com/dotnet/runtime/pull/66967 where I had to skip renumbering the blocks so the impact of "unreachable block elimination" doesn't regress register allocation.

Just to give a sense, I tried following strategies and the impact it has on code size is huge.

Strategy Methods improved Methods regressed Code size impact (in bytes)
Prefer block1 over block2 if block1 has more predecessors than those of block2. 2292 3313 23028
Prefer block1 over block2 if block1 has less predecessors than those of block2. 2669 3141 3079
Prefer block1 over block2 if block1 has more bbReach nodes than those of block2. 3495 3923 14156
Prefer block1 over block2 if block1 has less bbReach nodes than those of block2. 2642 3425 4155

Some other strategies could be:

  1. Prefer blocks depending on if they have critical in/out edges.
  2. Prefer blocks depending on if they have EH boundary in/out edges.
  3. Prefer blocks if they are in loop (this could be mostly taken care by bbWeight but should double check).

A clever way is to have a mix of above listed strategies in choosing the block sequencing.

category:design theme:register-allocator skill-level:expert cost:medium impact:large

kunalspathak commented 2 years ago

@dotnet/jit-contrib

AndyAyersMS commented 2 years ago

See also https://github.com/dotnet/runtime/issues/9833#issuecomment-372847082 (a few years old now)

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details
LSRA comes up with a block order in which it would allocate register for the code inside it. It would sequence all the predecessors of a block before scheduling a block. It also compares the `bbWeight` of a block and gives preference to the block with higher weights. It is not uncommon to have multiple blocks of same weight (either because of lack of enough profile data or other reasons) and in such cases, we rely on the block number of the block as seen below. https://github.com/dotnet/runtime/blob/96ef47b2c6d6591cebba4eb260763576621e8236/src/coreclr/jit/lsra.cpp#L1030-L1060 This is arguably not a good heuristic to arrange LSRA block sequencing. The `bbNum` are purely dependent on the how the blocks are laid out our data structure and might not represent the flow graph. The original thinking could be that the blocks with lower `bbNum` are executed first than the ones with higher numbers, but still, it is not the optimal strategy to have. If the blocks are renumbered, it can drastically change the allocation strategy because of different block sequencing coming out of it. This was seen in https://github.com/dotnet/runtime/pull/66967 where I had to skip renumbering the blocks so the impact of "unreachable block elimination" doesn't regress register allocation. Just to give a sense, I tried following strategies and the impact it has on code size is huge. | Strategy | Methods improved | Methods regressed | Code size impact (in bytes) | |--------------------------------------------------------------------------------------------|------------------|-------------------|-----------------------------| | Prefer `block1` over `block2` if `block1` has **more predecessors** than those of `block2`. | 2292 | 3313 | 23028 | | Prefer `block1` over `block2` if `block1` has **less predecessors** than those of `block2`. | 2669 | 3141 | 3079 | | Prefer `block1` over `block2` if `block1` has **more `bbReach`** nodes than those of `block2`. | 3495 | 3923 | 14156 | | Prefer `block1` over `block2` if `block1` has **less `bbReach`** nodes than those of `block2`. | 2642 | 3425 | 4155 | Some other strategies could be: 1. Prefer blocks depending on if they have critical in/out edges. 2. Prefer blocks depending on if they have EH boundary in/out edges. 3. Prefer blocks if they are in loop (this could be mostly taken care by `bbWeight` but should double check). A clever way is to have a mix of above listed strategies in choosing the block sequencing.
Author: kunalspathak
Assignees: kunalspathak
Labels: `area-CodeGen-coreclr`
Milestone: -
kunalspathak commented 2 years ago

Some more ideas:

  1. Ideally, for a block, we would like to schedule a block that will most likely be the successor of it (pgo wise). We have that information (may be outdated), but currently, our requirements are to sequence all the predecessors before scheduling that block. It might be tricky to get both these heuristics simultaneously.
  2. Another idea would be to produce minimum spanning tree where blocks are connected with each other based upon the number of variables they share. Stronger the blocks connected, more variables they have in common. When time comes to scheduling a block for LSRA, we can look at which one is strongly connected to the current block and sequence that.