JIT: Try more optimal fixpoint computation strategies for liveness

jakobbotsch commented 1 year ago

We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in https://github.com/dotnet/runtime/pull/86043#discussion_r1199099286.

Another interesting experiment would be to compute strongly connected components and compute the fixpoint within each SCC.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in https://github.com/dotnet/runtime/pull/86043#discussion_r1199099286. Another interesting experiment would be to compute strongly connected components and compute liveness within each SCC.

Author:	jakobbotsch
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS commented 1 year ago

Since liveness runs backwards we likely want just a plain postorder (successors before preds), not a reverse postorder (preds before successors.

jakobbotsch commented 1 year ago

https://github.com/dotnet/runtime/compare/main...jakobbotsch:runtime:liveness-fixpoint-strategies?expand=1 has an experimental version of the SCC approach, computed with Tarjan's algorithm.

While finding the SCCs themselves is not too bad (code wise), actually iterating the successors that may affect liveness is very complicated and very expensive. I had to add a visitor-based successor function to not lose all the throughput wins on the extra GetAllSuccs calls as part of finding SCCs, and the successor code is still showing up on top.

The throughput improvement is pretty modest:

Base: 97346934422, Diff: 97032570982, -0.3229%

VisitAllSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                    : 200204050  : NA      : 8.64%  : +0.2057%
?fgLocalVarLivenessFindSCC@Compiler@@QEAAXPEAUSCCNode@@AEAU2@AEAHAEAV?$ArrayStack@PEAUSCCNode@@@@@Z            : 156037841  : NA      : 6.73%  : +0.1603%
?Push@?$ArrayStack@PEAUGenTree@@@@QEAAXPEAUGenTree@@@Z                                                         : 146341436  : +72.21% : 6.31%  : +0.1503%
VisitLivenessSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                               : 125829706  : NA      : 5.43%  : +0.1293%
`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1>::operator()                                             : 121235726  : NA      : 5.23%  : +0.1245%
VisitEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                     : 85531421   : NA      : 3.69%  : +0.0879%
?fgLocalVarLivenessFindSCCs@Compiler@@QEAAXXZ                                                                  : 82035143   : NA      : 3.54%  : +0.0843%
VisitSuccessorsEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                           : 70136763   : NA      : 3.03%  : +0.0720%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                   : 8903876    : +0.43%  : 0.38%  : +0.0091%
memset                                                                                                         : 2889512    : +0.40%  : 0.12%  : +0.0030%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z          : -4796448   : -1.50%  : 0.21%  : -0.0049%
?fgGetHandlerLiveVars@Compiler@@QEAAPEA_KPEAUBasicBlock@@@Z                                                    : -7906158   : -10.88% : 0.34%  : -0.0081%
?LivenessDLong@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@CAXPEAVCompiler@@AEAPEA_KQEA_K22@Z : -23881701  : -13.93% : 1.03%  : -0.0245%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z                                           : -32483225  : -17.34% : 1.40%  : -0.0334%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z               : -62872657  : -25.52% : 2.71%  : -0.0646%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                       : -78804203  : -16.77% : 3.40%  : -0.0810%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z                                                                     : -99404116  : -14.97% : 4.29%  : -0.1021%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                             : -109928534 : -16.83% : 4.74%  : -0.1129%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                             : -128716399 : -18.45% : 5.55%  : -0.1322%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                               : -145512470 : -14.04% : 6.28%  : -0.1495%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                 : -618437434 : -32.10% : 26.68% : -0.6353%

It would also need to be written in an iterative version. Doesn't seem like it's worth it, though maybe it'd be worth it to look at switching the rest of the JIT away from GetAllSuccs to this visitor-based function instead.

AndyAyersMS commented 1 year ago

Some idle thoughts:

What is the overall cost of liveness (or, what is the % improvement in just liveness from the above?). If it is small then none of the rest below will matter much. Or we could see liveness cost as being small as an opportunity to further boost the tracked var limit...
I wonder if it's worth memoizing the successor traversal. Say we keep a bbNum->block map, and then we keep a bit vector per node with the successor indices. Or we just allocate bespoke per-block array, that we allocate / populate the first time through.
I was thinking postorder might work out because it wouldn't impact much else, even if the wins are modest.
Within an SCC there's still a question of ensuring a good traversal order (so you would likely want to run postorder inside the SCC).
There should be various monotonicity guarantees within liveness (eg live sets never shrink as we iterate) and if we were sufficiently motivated (say making the various bit set ops return a bool indicating if any bit was changed) we could operate directly on the block vectors and not on scratch copies. That would save a bit of work zeroing, comparing, and copying bit vectors.
Another thought along those lines is to have blocks push their live ins to their predecessors rather than having a block pull live ins from its successors, since we have convenient (though perhaps incomplete) list of predecessors. But then we'd need a per-block flags to detect if any of those pushes changed the pred's live out so we would know to revisit that pred.
At some point (with sufficiently large index sets) we ought to consider sparse bit vectors. Only so many things can be live at any given place in the code. It might be interesting to compute density metrics and see what they tell us. I know of a sparse BV implementation we could leverage. However it can cause a lot of allocation churn so we'd probably also want some kind of node pool/free list.
- [edit] looks like hashBv already implments a sparse vector, but it is not used much and its interface diverges from our other bit vectors

jakobbotsch commented 11 months ago

I tried to switch liveness to use a PO traversal as part of #95085, to see if it would make up for some of the throughput cost. The code is here: https://github.com/jakobbotsch/runtime/pull/new/dead-block-elimination-liveness-post-order Sadly I didn't keep/share the detailed throughput results anywhere, but if I remember right the results did not look like it gained us that much.

We might be able to be even more efficient by using the loop table, at least in the future if we end up with the more "general" loop finding in the middle end through #95251. If all cycles are natural loops then we can iterate their SCCs to completion before proceeding with other blocks. However iterating (only) the loop blocks repeatedly will be more expensive than iterating the full post-order, since the loop blocks are stored as a bit vector while the full post-order is much more efficient to access.

AndyAyersMS commented 11 months ago

It would be nice to be able to efficiently use bit vectors as enumerable sets of blocks, but it requires keeping a mapping from bit vector indices back to blocks, and it would be nice if that persisted and didn't need to be rebuilt all the time, so we could amortize its cost.

Likely this means using bbID or something similar as a BV index that doesn't change ever (or often); since that may be sparse and increase over time, some kind of sparse-ish representation for BVs (maybe a compact known-sized part with extensions?), and perhaps just a hash map for the id -> block part?

jakobbotsch commented 11 months ago

I am using the post-order index for the bit vectors in #95251, coupled with storing the post-order itself in a reachable place (the DFS tree contains it and is referenced by the loop class). It means you can add and remove blocks from the flow graph itself while still being able to map back using the stored post order array. Of course if you are adding and removing blocks from the flow graph then you are invalidating the information in the DFS tree and loop structures, but that is to be expected, and each individual pass needs to deal with that. I think the DFS and loop identification is cheap enough that just rebuilding the structures at the end of the phase is something we might seriously consider. For example, I have a local change where I port loop cloning to the new representation, and the throughput cost is around 0.4% for building the DFS and identifying loops, then doing loop cloning, and then rebuilding the structures if any flowgraph modifications were made as part of loop cloning. I fully expect us to easily make up that 0.4% throughput by removing the old DFS, reachability and dominator recomputations that we do all the time.

AndyAyersMS commented 11 months ago

If recomputing the graph "metadata" becomes cheap enough then that's certainly very appealing, since as you say the other downside of a persistent representation is keeping it accurate over time.

jakobbotsch commented 4 months ago

This was fixed by #103809, we're now using RPO during the dataflow.

dotnet / runtime

JIT: Try more optimal fixpoint computation strategies for liveness #86527