Closed EgorBo closed 5 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | EgorBo |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
Assigning Jakob to triage it (as a struct promotion owner)
when we spill all stack entries to locals at the end of a block, we should not spill constants and local addresses
If the successor block is not a join this might be easier to enable -- the complicated case is one where a block's successor has other predecessors, since all predecessors must produce identical stacks of temps so that any one predecessor's exit state represents all predecessors' exit state.
The rough idea would be something like this: in impImportBlock
if all the block's successors have only this block as predecessor, then only spill the expensive trees to temps (and if there is just one successor, maybe don't spill anything). We already compute the right property via multRef
but don't use it for anything, so perhaps sometime in the past there was a similar optimization and proved unworkable or got removed.
Also it seems like we are always cloning the exit state trees when importing a block. Sems like we could optimize this to use the trees directly for one successor and only clone for the others (this happens in verInitBBEntryState
). If we did that it would save a bit of time and space for cases where we see lots of non-empty stacks at block ends.
If the successor block is not a join this might be easier to enable -- the complicated case is one where a block's successor has other predecessors, since all predecessors must produce identical stacks of temps so that any one predecessor's exit state represents all predecessors' exit state.
In this case we manage to fold the join away due to inlining, so it seems conceivable to handle this particular case in that way. However, I think in most natural cases produced by Roslyn (assuming this pattern mainly is produced for ternaries), we are going to see both a split and a join. E.g.:
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Foo(bool b)
{
S s;
s.Opts = b ? 1 : 2;
}
public struct S
{
public int Opts;
}
also ends up with the unfortunate address exposure, and in this case there is both a split and join. Allowing the propagation when there is a unique predecessor would change the IR from:
***** BB01
STMT00001 ( ??? ... 0x003 )
N002 ( 7, 6) [000005] DA--------- ▌ STORE_LCL_VAR byref V03 tmp1 // store of IL stack entries for clique BB01 -> BB02, BB03
N001 ( 3, 3) [000000] ----------- └──▌ LCL_ADDR byref V01 loc0 [+0]
▌ int V01.Program+S:Opts (offs=0x00) -> V06 tmp4
***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
N005 ( 7, 8) [000004] ----------- ▌ JTRUE void
N004 ( 5, 6) [000003] J------N--- └──▌ NE int
N002 ( 3, 4) [000024] ----------- ├──▌ CAST int <- bool <- int
N001 ( 2, 2) [000001] ----------- │ └──▌ LCL_VAR int V00 arg0
N003 ( 1, 1) [000002] ----------- └──▌ CNS_INT int 0
------------ BB02 [005..008) -> BB04 (always), preds={BB01} succs={BB04}
***** BB02
STMT00006 ( ??? ... 0x006 )
N002 ( 7, 5) [000020] DA--------- ▌ STORE_LCL_VAR byref V04 tmp2 // store 1 of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 3, 2) [000007] ----------- └──▌ LCL_VAR byref V03 tmp1
***** BB02
STMT00007 ( ??? ... ??? )
N002 ( 5, 4) [000022] DA--------- ▌ STORE_LCL_VAR int V05 tmp3 // store 2 of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 1, 1) [000019] ----------- └──▌ CNS_INT int 2
------------ BB03 [008..009), preds={BB01} succs={BB04}
***** BB03
STMT00002 ( ??? ... 0x008 )
N002 ( 7, 5) [000010] DA--------- ▌ STORE_LCL_VAR byref V04 tmp2 // store 1 of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 3, 2) [000008] ----------- └──▌ LCL_VAR byref V03 tmp1
***** BB03
STMT00003 ( ??? ... ??? )
N002 ( 5, 4) [000012] DA--------- ▌ STORE_LCL_VAR int V05 tmp3 // store 2 of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 1, 1) [000009] ----------- └──▌ CNS_INT int 1
------------ BB04 [009..00F) (return), preds={BB02,BB03} succs={}
***** BB04
STMT00004 ( ??? ... 0x009 )
N003 ( 10, 7) [000017] -A-XG------ ▌ STOREIND int
N001 ( 3, 2) [000014] ----------- ├──▌ LCL_VAR byref V04 tmp2
N002 ( 3, 2) [000015] ----------- └──▌ LCL_VAR int V05 tmp3
to
***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
N005 ( 7, 8) [000004] ----------- ▌ JTRUE void
N004 ( 5, 6) [000003] J------N--- └──▌ NE int
N002 ( 3, 4) [000024] ----------- ├──▌ CAST int <- bool <- int
N001 ( 2, 2) [000001] ----------- │ └──▌ LCL_VAR int V00 arg0
N003 ( 1, 1) [000002] ----------- └──▌ CNS_INT int 0
------------ BB02 [005..008) -> BB04 (always), preds={BB01} succs={BB04}
***** BB02
STMT00006 ( ??? ... 0x006 )
N002 ( 7, 5) [000020] DA--------- ▌ STORE_LCL_VAR byref V04 tmp2 // new store of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 3, 3) [000000] ----------- └──▌ LCL_ADDR byref V01 loc0 [+0]
▌ int V01.Program+S:Opts (offs=0x00) -> V06 tmp4
***** BB02
STMT00007 ( ??? ... ??? )
N002 ( 5, 4) [000022] DA--------- ▌ STORE_LCL_VAR int V05 tmp3
N001 ( 1, 1) [000019] ----------- └──▌ CNS_INT int 2
------------ BB03 [008..009), preds={BB01} succs={BB04}
***** BB03
STMT00002 ( ??? ... 0x008 )
N002 ( 7, 5) [000010] DA--------- ▌ STORE_LCL_VAR byref V04 tmp2 // new store of IL stack entries for clique BB02, BB03 -> BB04
N001 ( 3, 3) [000000] ----------- └──▌ LCL_ADDR byref V01 loc0 [+0]
▌ int V01.Program+S:Opts (offs=0x00) -> V06 tmp4
***** BB03
STMT00003 ( ??? ... ??? )
N002 ( 5, 4) [000012] DA--------- ▌ STORE_LCL_VAR int V05 tmp3
N001 ( 1, 1) [000009] ----------- └──▌ CNS_INT int 1
------------ BB04 [009..00F) (return), preds={BB02,BB03} succs={}
***** BB04
STMT00004 ( ??? ... 0x009 )
N003 ( 10, 7) [000017] -A-XG------ ▌ STOREIND int
N001 ( 3, 2) [000014] ----------- ├──▌ LCL_VAR byref V04 tmp2
N002 ( 3, 2) [000015] ----------- └──▌ LCL_VAR int V05 tmp3
We could potentially handle this when importing BB04 by checking the predecessor assignments into the spill temps. But we would need to try harder to import blocks in RPO as we don't import these in the right order to do this today (the import order is BB01 -> BB03 -> BB04 -> BB02).
Seems like it would be useful here and in morph (and maybe in more phases) to build a general worklist driven algorithm for processing in RPO. The rough idea would be to first run a DFS to identify a DAG (or adopt some other DAG-identifying convention like all lexically backward edges are back edges) and then set up a priority queue (or similar) on blocks with the priority being the number of unvisited preds.
Then process all the priority 0 nodes. As each node finishes decrement the priority of each successor reached along a non-backedge, until the queue is empty. If the phase trims away an edge because of optimization, then also decrement the successor count (if a non-backedge).
If new blocks are added or flow is altered things get tricker. Not a problem for the importer but morph can do various flow edits so we'd need to be careful.
We would need a priority queue implementation that is not to allocation-happy, but we can likely use a free list for the internal data structures so can keep the allocation proportional to the number of blocks and perhaps amortize that over multiple traversals. Or since we really only care about nodes of priority zero, just keep a block->count map and whenever a block's count gets to zero, move it to the worklist.
I have a desire to fix this in 9.0 by introducing some limited flow-sensitive propagation of LCL_ADDR
values as part of local morph. It would fix this issue and it would improve many cases where inlining can end up spilling LCL_ADDR
nodes (e.g. #69254, many of the Matrix4x4
and Matrix3x2
methods).
This could also be handled via tail duplication, basically move (and duplicate) the storeind backwards into its preds, based on the intersection of the variables it reads and the constants available for those locals in the preds. I am playing around with this idea in some other contexts...
Minimal repro for a CQ issue @stephentoub noticed in https://github.com/dotnet/runtime/pull/87067: a constant doesn't succesfully propagate through a struct:
Codegen for
Test
:Now replace that ternary operation with
if-else
(uncomment it and remove the ternary):JitDump diff: https://www.diffchecker.com/ld33PGnr (left -
if-esle
, right -ternary
). So, apparently, we don't enregister the Opts field due to address exposure, @SingleAccretion suggested that when we spill all stack entries to locals at the end of a block, we should not spill constants and local addresses. @AndyAyersMS noted that it might be a big amount of work to do so.@dotnet/jit-contrib @jakobbotsch