dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.14k stars 4.71k forks source link

JIT: Flow Graph Modernization and Improved Block Layout #93020

Closed AndyAyersMS closed 2 months ago

AndyAyersMS commented 1 year ago

Overview

The current block layout algorithm in the JIT is based on local permutations of block order. It is complicated and likely far from optimal. We would like to improve the overall block layout algorithm used by the JIT, in particular adopting a global cost-minimizing approach to layout—for instance, one in the style of Young et. al.'s Near-optimal intraprocedural branch alignment. Additional complexities arise in our case because of various EH reporting requirements, so the JIT cannot freely reorder all blocks, but we should be able to apply the global ordering techniques within EH regions.

Before we can tackle this problem there are several important (and sizeable) prerequisites, which we can lump together as "flow graph modernization." There are a lot of details here, but at a high level:

It is not yet clear how much progress we can make during .NET 9. The list of items below is preliminary and subject to change.

Motivation

Past studies have shown that the two phases that benefit most from block-level PGO data are inlining and block layout. In a previous compiler project, the net benefit from PGO was on the order of 15%, with about 12% attributable to inlining, and 2% to improved layout.

The current JIT is likely seeing a much smaller benefit from layout. The goal here is to ensure that we are using the accurate PGO data to make informed decisions about the ordering of blocks, with the hope of realizing perhaps a 1 or 2% net benefit across a wide range of applications (with some benefiting much more, and others, not at all).

Flow Graph Modernization

Block Layout

cc @amanasifkhalid @dotnet/jit-contrib

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
### Overview The current block layout algorithm in the JIT is based local permutations of block order. It is complicated and likely far from optimal. We would like to improve the overall block layout algorithm used by the JIT, in particular adopting a glboal cost-minimizing approach to layout—for instance, one in the style of Young et. al.'s [Near-optimal intraprocedural branch alignment](https://dl.acm.org/doi/abs/10.1145/258915.258932). Additional complexities arise in our case because of various EH reporting requirements, so the JIT cannot freely reorder all blocks, but we should be able to apply the global ordering techniques within EH regions. Before we can tackle this problem there are several important (and sizeable) prerequisites, which we can lump together as "flow graph modernization." There are a lot of details here, but at a high level: * Update the flow graph to make fall-through behavior explicit for most phases (until layout). This means disallowing BBJ_NONE and adding explicit fall through successors for BBJ_COND and BBJ_SWITCH. More on this below. * Defer most block reordering work until late in the JIT phase pipeline (ideally, perhaps, waiting until after LSRA has run, so we can properly position any new blocks it introduces). * Leverage the work done in .NET 8 to make flow edges persistent and ensure that those edges accurately describe successor likelihoods. Possibly make successor edge enumeration a first-class concept. It is not yet clear how much progress we can make during .NET 9. The list of items below is preliminary and subject to change. #### Flow Graph Modernization - [ ] No implicit fall through (until layout) - [ ] Hide various fields on `BasicBlock` behind setters/getters (`bbNext`, `bbJumpKind`, `bbJumpTarget`, ...) - https://github.com/dotnet/runtime/pull/92908 - [ ] Introduce explicit fall through successor field and make sure it is properly updated. Initially it will just mirror `bbNext` for blocks with fall-through jump kinds - [ ] Start allowing the fall through target to diverge from `bbNext`. Might be best to do this gradually, working front-to-back through the phases, and then restoring correspondence. Eventually we'll be restoring right at block layout. Main work here is root out places that implicitly rely on `bbNext` ordering having some important semantic. Note the `bbNext` order can still reflect likely layout order (eg it can agree with fall throughs after we renumber/etc). - [ ] Start removing / conditionalizing the logic that reverses branches and/or introduces new blocks to preserve implicit fall-through behavior, upstream of the changes above - [ ] Defer block reordering (until layout). - [ ] Disable any remaining bits of code in the jit that reorder blocks early. We likely will still want to run various flow graph simplification steps (combining blocks, removing empty blocks, etc). But we should not need to reverse conditionals. - [ ] Make edge profile data reliable throughout the phases - [ ] Ensure that edge likelihoods are set for all flow edges, then ensure that successor edge likelihoods sum to 1.0. - [ ] Make access to successor edges cheaper/more explicit. For instance, replace direct `BasicBlock` references for `bbJumpTarget` with the appropriate `FlowEdge`. Consider (perhaps) linking `FlowEdges` in a successor list as well as predecessor list. - [ ] Look into automatic maintenance of pred and successor edge lists - [ ] Audit updates to likelihoods to make sure they are locally reasonable - [ ] Introduce a profile repair phase that we run (likely before LSRA/layout) to re-establish global consistency of block weights - [ ] As a part of this, find a solution to recovering block weights in graphs with irreducible loops #### Block Layout - [ ] Look into running existing layout after LSRA. Note there may be some tricky interactions, if say reversing a conditional perturbs the allocation for some reason - [ ] Building on the work above, leverage block weights and successor edge likelihoods to build a good initial layout via something like a greedy RPO - [ ] Implement some sort of layout score describing costs of a layout - [ ] (Possibly) find a way of estimating the optimal layout score - [ ] Implement a scheme to improve layout based on lowering score; one such scheme is described in [Near-optimal intraprocedural branch alignment](https://dl.acm.org/doi/abs/10.1145/258915.258932) - [ ] Think about the interaction of layout and hot/cold splitting cc @amanasifkhalid @dotnet/jit-contrib
Author: AndyAyersMS
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: 9.0.0
jakobbotsch commented 10 months ago

As a prerequisite for some of the block reordering work we'll likely need to change loop alignment to be more centralized. We currently identify the initial candidate blocks to place loop alignment instructions in during loop finding and apply some heuristics when computing loop side effects during VN. We will probably need to defer all of these decisions to happen after block reordering. I am also inclined to say that we should just recompute the loops at that point, instead of trying to maintain loop information -- we have a lot of code that works hard to maintain bbNatLoopNum all the way into the backend that we could remove. Since the block reordering is likely to need a DFS as well, the extra TP we'll end up paying is just for the loop identification, which is not that much (tpdiff).

amanasifkhalid commented 2 months ago

Potential .NET 10 items

Block layout:

Flowgraph Modernization:

cc @AndyAyersMS, feel free to add to this.

AndyAyersMS commented 2 months ago

I think that's a good start. I'm going to move this issue to .NET 10 for now, later we can decide if we want to split off a new issue or just revise this one.

AndyAyersMS commented 2 months ago

Let's split off a new issue for future work.