Closed jakobbotsch closed 4 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | jakobbotsch |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
Since liveness runs backwards we likely want just a plain postorder (successors before preds), not a reverse postorder (preds before successors.
https://github.com/dotnet/runtime/compare/main...jakobbotsch:runtime:liveness-fixpoint-strategies?expand=1 has an experimental version of the SCC approach, computed with Tarjan's algorithm.
While finding the SCCs themselves is not too bad (code wise), actually iterating the successors that may affect liveness is very complicated and very expensive. I had to add a visitor-based successor function to not lose all the throughput wins on the extra GetAllSuccs
calls as part of finding SCCs, and the successor code is still showing up on top.
The throughput improvement is pretty modest:
Base: 97346934422, Diff: 97032570982, -0.3229%
VisitAllSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 200204050 : NA : 8.64% : +0.2057%
?fgLocalVarLivenessFindSCC@Compiler@@QEAAXPEAUSCCNode@@AEAU2@AEAHAEAV?$ArrayStack@PEAUSCCNode@@@@@Z : 156037841 : NA : 6.73% : +0.1603%
?Push@?$ArrayStack@PEAUGenTree@@@@QEAAXPEAUGenTree@@@Z : 146341436 : +72.21% : 6.31% : +0.1503%
VisitLivenessSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 125829706 : NA : 5.43% : +0.1293%
`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1>::operator() : 121235726 : NA : 5.23% : +0.1245%
VisitEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 85531421 : NA : 3.69% : +0.0879%
?fgLocalVarLivenessFindSCCs@Compiler@@QEAAXXZ : 82035143 : NA : 3.54% : +0.0843%
VisitSuccessorsEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 70136763 : NA : 3.03% : +0.0720%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z : 8903876 : +0.43% : 0.38% : +0.0091%
memset : 2889512 : +0.40% : 0.12% : +0.0030%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z : -4796448 : -1.50% : 0.21% : -0.0049%
?fgGetHandlerLiveVars@Compiler@@QEAAPEA_KPEAUBasicBlock@@@Z : -7906158 : -10.88% : 0.34% : -0.0081%
?LivenessDLong@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@CAXPEAVCompiler@@AEAPEA_KQEA_K22@Z : -23881701 : -13.93% : 1.03% : -0.0245%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z : -32483225 : -17.34% : 1.40% : -0.0334%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z : -62872657 : -25.52% : 2.71% : -0.0646%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z : -78804203 : -16.77% : 3.40% : -0.0810%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z : -99404116 : -14.97% : 4.29% : -0.1021%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z : -109928534 : -16.83% : 4.74% : -0.1129%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z : -128716399 : -18.45% : 5.55% : -0.1322%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z : -145512470 : -14.04% : 6.28% : -0.1495%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z : -618437434 : -32.10% : 26.68% : -0.6353%
It would also need to be written in an iterative version. Doesn't seem like it's worth it, though maybe it'd be worth it to look at switching the rest of the JIT away from GetAllSuccs
to this visitor-based function instead.
Some idle thoughts:
hashBv
already implments a sparse vector, but it is not used much and its interface diverges from our other bit vectorsI tried to switch liveness to use a PO traversal as part of #95085, to see if it would make up for some of the throughput cost. The code is here: https://github.com/jakobbotsch/runtime/pull/new/dead-block-elimination-liveness-post-order Sadly I didn't keep/share the detailed throughput results anywhere, but if I remember right the results did not look like it gained us that much.
We might be able to be even more efficient by using the loop table, at least in the future if we end up with the more "general" loop finding in the middle end through #95251. If all cycles are natural loops then we can iterate their SCCs to completion before proceeding with other blocks. However iterating (only) the loop blocks repeatedly will be more expensive than iterating the full post-order, since the loop blocks are stored as a bit vector while the full post-order is much more efficient to access.
It would be nice to be able to efficiently use bit vectors as enumerable sets of blocks, but it requires keeping a mapping from bit vector indices back to blocks, and it would be nice if that persisted and didn't need to be rebuilt all the time, so we could amortize its cost.
Likely this means using bbID
or something similar as a BV index that doesn't change ever (or often); since that may be sparse and increase over time, some kind of sparse-ish representation for BVs (maybe a compact known-sized part with extensions?), and perhaps just a hash map for the id -> block part?
I am using the post-order index for the bit vectors in #95251, coupled with storing the post-order itself in a reachable place (the DFS tree contains it and is referenced by the loop class). It means you can add and remove blocks from the flow graph itself while still being able to map back using the stored post order array. Of course if you are adding and removing blocks from the flow graph then you are invalidating the information in the DFS tree and loop structures, but that is to be expected, and each individual pass needs to deal with that. I think the DFS and loop identification is cheap enough that just rebuilding the structures at the end of the phase is something we might seriously consider. For example, I have a local change where I port loop cloning to the new representation, and the throughput cost is around 0.4% for building the DFS and identifying loops, then doing loop cloning, and then rebuilding the structures if any flowgraph modifications were made as part of loop cloning. I fully expect us to easily make up that 0.4% throughput by removing the old DFS, reachability and dominator recomputations that we do all the time.
If recomputing the graph "metadata" becomes cheap enough then that's certainly very appealing, since as you say the other downside of a persistent representation is keeping it accurate over time.
This was fixed by #103809, we're now using RPO during the dataflow.
We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in https://github.com/dotnet/runtime/pull/86043#discussion_r1199099286.
Another interesting experiment would be to compute strongly connected components and compute the fixpoint within each SCC.