dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.35k stars 4.74k forks source link

JIT: Try more optimal fixpoint computation strategies for liveness #86527

Closed jakobbotsch closed 4 months ago

jakobbotsch commented 1 year ago

We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in https://github.com/dotnet/runtime/pull/86043#discussion_r1199099286.

Another interesting experiment would be to compute strongly connected components and compute the fixpoint within each SCC.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in https://github.com/dotnet/runtime/pull/86043#discussion_r1199099286. Another interesting experiment would be to compute strongly connected components and compute liveness within each SCC.
Author: jakobbotsch
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
AndyAyersMS commented 1 year ago

Since liveness runs backwards we likely want just a plain postorder (successors before preds), not a reverse postorder (preds before successors.

jakobbotsch commented 1 year ago

https://github.com/dotnet/runtime/compare/main...jakobbotsch:runtime:liveness-fixpoint-strategies?expand=1 has an experimental version of the SCC approach, computed with Tarjan's algorithm.

While finding the SCCs themselves is not too bad (code wise), actually iterating the successors that may affect liveness is very complicated and very expensive. I had to add a visitor-based successor function to not lose all the throughput wins on the extra GetAllSuccs calls as part of finding SCCs, and the successor code is still showing up on top.

The throughput improvement is pretty modest:

Base: 97346934422, Diff: 97032570982, -0.3229%

VisitAllSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                    : 200204050  : NA      : 8.64%  : +0.2057%
?fgLocalVarLivenessFindSCC@Compiler@@QEAAXPEAUSCCNode@@AEAU2@AEAHAEAV?$ArrayStack@PEAUSCCNode@@@@@Z            : 156037841  : NA      : 6.73%  : +0.1603%
?Push@?$ArrayStack@PEAUGenTree@@@@QEAAXPEAUGenTree@@@Z                                                         : 146341436  : +72.21% : 6.31%  : +0.1503%
VisitLivenessSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                               : 125829706  : NA      : 5.43%  : +0.1293%
`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1>::operator()                                             : 121235726  : NA      : 5.23%  : +0.1245%
VisitEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                     : 85531421   : NA      : 3.69%  : +0.0879%
?fgLocalVarLivenessFindSCCs@Compiler@@QEAAXXZ                                                                  : 82035143   : NA      : 3.54%  : +0.0843%
VisitSuccessorsEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                           : 70136763   : NA      : 3.03%  : +0.0720%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                   : 8903876    : +0.43%  : 0.38%  : +0.0091%
memset                                                                                                         : 2889512    : +0.40%  : 0.12%  : +0.0030%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z          : -4796448   : -1.50%  : 0.21%  : -0.0049%
?fgGetHandlerLiveVars@Compiler@@QEAAPEA_KPEAUBasicBlock@@@Z                                                    : -7906158   : -10.88% : 0.34%  : -0.0081%
?LivenessDLong@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@CAXPEAVCompiler@@AEAPEA_KQEA_K22@Z : -23881701  : -13.93% : 1.03%  : -0.0245%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z                                           : -32483225  : -17.34% : 1.40%  : -0.0334%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z               : -62872657  : -25.52% : 2.71%  : -0.0646%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                       : -78804203  : -16.77% : 3.40%  : -0.0810%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z                                                                     : -99404116  : -14.97% : 4.29%  : -0.1021%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                             : -109928534 : -16.83% : 4.74%  : -0.1129%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                             : -128716399 : -18.45% : 5.55%  : -0.1322%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                               : -145512470 : -14.04% : 6.28%  : -0.1495%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                 : -618437434 : -32.10% : 26.68% : -0.6353%

It would also need to be written in an iterative version. Doesn't seem like it's worth it, though maybe it'd be worth it to look at switching the rest of the JIT away from GetAllSuccs to this visitor-based function instead.

AndyAyersMS commented 1 year ago

Some idle thoughts:

jakobbotsch commented 11 months ago

I tried to switch liveness to use a PO traversal as part of #95085, to see if it would make up for some of the throughput cost. The code is here: https://github.com/jakobbotsch/runtime/pull/new/dead-block-elimination-liveness-post-order Sadly I didn't keep/share the detailed throughput results anywhere, but if I remember right the results did not look like it gained us that much.

We might be able to be even more efficient by using the loop table, at least in the future if we end up with the more "general" loop finding in the middle end through #95251. If all cycles are natural loops then we can iterate their SCCs to completion before proceeding with other blocks. However iterating (only) the loop blocks repeatedly will be more expensive than iterating the full post-order, since the loop blocks are stored as a bit vector while the full post-order is much more efficient to access.

AndyAyersMS commented 11 months ago

It would be nice to be able to efficiently use bit vectors as enumerable sets of blocks, but it requires keeping a mapping from bit vector indices back to blocks, and it would be nice if that persisted and didn't need to be rebuilt all the time, so we could amortize its cost.

Likely this means using bbID or something similar as a BV index that doesn't change ever (or often); since that may be sparse and increase over time, some kind of sparse-ish representation for BVs (maybe a compact known-sized part with extensions?), and perhaps just a hash map for the id -> block part?

jakobbotsch commented 11 months ago

I am using the post-order index for the bit vectors in #95251, coupled with storing the post-order itself in a reachable place (the DFS tree contains it and is referenced by the loop class). It means you can add and remove blocks from the flow graph itself while still being able to map back using the stored post order array. Of course if you are adding and removing blocks from the flow graph then you are invalidating the information in the DFS tree and loop structures, but that is to be expected, and each individual pass needs to deal with that. I think the DFS and loop identification is cheap enough that just rebuilding the structures at the end of the phase is something we might seriously consider. For example, I have a local change where I port loop cloning to the new representation, and the throughput cost is around 0.4% for building the DFS and identifying loops, then doing loop cloning, and then rebuilding the structures if any flowgraph modifications were made as part of loop cloning. I fully expect us to easily make up that 0.4% throughput by removing the old DFS, reachability and dominator recomputations that we do all the time.

AndyAyersMS commented 11 months ago

If recomputing the graph "metadata" becomes cheap enough then that's certainly very appealing, since as you say the other downside of a persistent representation is keeping it accurate over time.

jakobbotsch commented 4 months ago

This was fixed by #103809, we're now using RPO during the dataflow.