dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.33k stars 4.74k forks source link

JIT: Constant is not propagated through a struct #87072

Closed EgorBo closed 5 months ago

EgorBo commented 1 year ago

Minimal repro for a CQ issue @stephentoub noticed in https://github.com/dotnet/runtime/pull/87067: a constant doesn't succesfully propagate through a struct:

struct Awaitable
{
    int Opts;

    Awaitable(bool value)
    {
        Opts = value ? 1 : 2;

        //
        //if (value)
        //    Opts = 1;
        //else
        //    Opts = 2;
    }
}

public static int Test() => new Awaitable(false).Opts;

Codegen for Test:

; Assembly listing for method Awaitable:Test():int
       push     rax
       xor      eax, eax
       mov      dword ptr [rsp], eax
       mov      dword ptr [rsp], 2
       mov      eax, dword ptr [rsp]
       add      rsp, 8
       ret      
; Total bytes of code 21

Now replace that ternary operation with if-else (uncomment it and remove the ternary):

; Method Awaitable:Test():int
       mov      eax, 2
       ret
; Total bytes of code: 6

JitDump diff: https://www.diffchecker.com/ld33PGnr (left - if-esle, right - ternary). So, apparently, we don't enregister the Opts field due to address exposure, @SingleAccretion suggested that when we spill all stack entries to locals at the end of a block, we should not spill constants and local addresses. @AndyAyersMS noted that it might be a big amount of work to do so.

@dotnet/jit-contrib @jakobbotsch

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
Minimal repro for a CQ issue @stephentoub noticed in https://github.com/dotnet/runtime/pull/87067: a constant doesn't succesfully propagate through a struct: ```csharp struct Awaitable { int Opts; Awaitable(bool value) { Opts = value ? 1 : 2; // //if (value) // Opts = 1; //else // Opts = 2; } } public static int Test() => new Awaitable(false).Opts; ``` Codegen for `Test`: ```asm ; Assembly listing for method Awaitable:Test():int push rax xor eax, eax mov dword ptr [rsp], eax mov dword ptr [rsp], 2 mov eax, dword ptr [rsp] add rsp, 8 ret ; Total bytes of code 21 ``` Now replace that ternary operation with `if-else` (uncomment it and remove the ternary): ```asm ; Method Awaitable:Test():int mov eax, 2 ret ; Total bytes of code: 6 ``` JitDump diff: https://www.diffchecker.com/ld33PGnr (left - `if-esle`, right - `ternary`). So, apparently, we don't enregister the Opts field due to address exposure, @SingleAccretion suggested that when we spill all stack entries to locals at the end of a block, we should not spill constants and local addresses. @AndyAyersMS noted that it might be a big amount of work to do so. @dotnet/jit-contrib @jakobbotsch
Author: EgorBo
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
EgorBo commented 1 year ago

Assigning Jakob to triage it (as a struct promotion owner)

AndyAyersMS commented 1 year ago

when we spill all stack entries to locals at the end of a block, we should not spill constants and local addresses

If the successor block is not a join this might be easier to enable -- the complicated case is one where a block's successor has other predecessors, since all predecessors must produce identical stacks of temps so that any one predecessor's exit state represents all predecessors' exit state.

The rough idea would be something like this: in impImportBlock if all the block's successors have only this block as predecessor, then only spill the expensive trees to temps (and if there is just one successor, maybe don't spill anything). We already compute the right property via multRef but don't use it for anything, so perhaps sometime in the past there was a similar optimization and proved unworkable or got removed.

Also it seems like we are always cloning the exit state trees when importing a block. Sems like we could optimize this to use the trees directly for one successor and only clone for the others (this happens in verInitBBEntryState). If we did that it would save a bit of time and space for cases where we see lots of non-empty stacks at block ends.

jakobbotsch commented 1 year ago

If the successor block is not a join this might be easier to enable -- the complicated case is one where a block's successor has other predecessors, since all predecessors must produce identical stacks of temps so that any one predecessor's exit state represents all predecessors' exit state.

In this case we manage to fold the join away due to inlining, so it seems conceivable to handle this particular case in that way. However, I think in most natural cases produced by Roslyn (assuming this pattern mainly is produced for ternaries), we are going to see both a split and a join. E.g.:

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Foo(bool b)
{
    S s;
    s.Opts = b ? 1 : 2;
}

public struct S
{
    public int Opts;
}

also ends up with the unfortunate address exposure, and in this case there is both a split and join. Allowing the propagation when there is a unique predecessor would change the IR from:

***** BB01
STMT00001 ( ??? ... 0x003 )
N002 (  7,  6) [000005] DA---------                         ▌  STORE_LCL_VAR byref  V03 tmp1                        // store of IL stack entries for clique BB01 -> BB02, BB03
N001 (  3,  3) [000000] -----------                         └──▌  LCL_ADDR  byref  V01 loc0         [+0]
                                                               ▌    int    V01.Program+S:Opts (offs=0x00) -> V06 tmp4         

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
N005 (  7,  8) [000004] -----------                         ▌  JTRUE     void  
N004 (  5,  6) [000003] J------N---                         └──▌  NE        int   
N002 (  3,  4) [000024] -----------                            ├──▌  CAST      int <- bool <- int
N001 (  2,  2) [000001] -----------                            │  └──▌  LCL_VAR   int    V00 arg0         
N003 (  1,  1) [000002] -----------                            └──▌  CNS_INT   int    0

------------ BB02 [005..008) -> BB04 (always), preds={BB01} succs={BB04}

***** BB02
STMT00006 ( ??? ... 0x006 )
N002 (  7,  5) [000020] DA---------                         ▌  STORE_LCL_VAR byref  V04 tmp2         // store 1 of IL stack entries for clique BB02, BB03 -> BB04
N001 (  3,  2) [000007] -----------                         └──▌  LCL_VAR   byref  V03 tmp1         

***** BB02
STMT00007 ( ??? ... ??? )
N002 (  5,  4) [000022] DA---------                         ▌  STORE_LCL_VAR int    V05 tmp3         // store 2 of IL stack entries for clique BB02, BB03 -> BB04
N001 (  1,  1) [000019] -----------                         └──▌  CNS_INT   int    2

------------ BB03 [008..009), preds={BB01} succs={BB04}

***** BB03
STMT00002 ( ??? ... 0x008 )
N002 (  7,  5) [000010] DA---------                         ▌  STORE_LCL_VAR byref  V04 tmp2         // store 1 of IL stack entries for clique BB02, BB03 -> BB04
N001 (  3,  2) [000008] -----------                         └──▌  LCL_VAR   byref  V03 tmp1         

***** BB03
STMT00003 ( ??? ... ??? )
N002 (  5,  4) [000012] DA---------                         ▌  STORE_LCL_VAR int    V05 tmp3         // store 2 of IL stack entries for clique BB02, BB03 -> BB04
N001 (  1,  1) [000009] -----------                         └──▌  CNS_INT   int    1

------------ BB04 [009..00F) (return), preds={BB02,BB03} succs={}

***** BB04
STMT00004 ( ??? ... 0x009 )
N003 ( 10,  7) [000017] -A-XG------                         ▌  STOREIND  int   
N001 (  3,  2) [000014] -----------                         ├──▌  LCL_VAR   byref  V04 tmp2         
N002 (  3,  2) [000015] -----------                         └──▌  LCL_VAR   int    V05 tmp3         

to

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
N005 (  7,  8) [000004] -----------                         ▌  JTRUE     void  
N004 (  5,  6) [000003] J------N---                         └──▌  NE        int   
N002 (  3,  4) [000024] -----------                            ├──▌  CAST      int <- bool <- int
N001 (  2,  2) [000001] -----------                            │  └──▌  LCL_VAR   int    V00 arg0         
N003 (  1,  1) [000002] -----------                            └──▌  CNS_INT   int    0

------------ BB02 [005..008) -> BB04 (always), preds={BB01} succs={BB04}

***** BB02
STMT00006 ( ??? ... 0x006 )
N002 (  7,  5) [000020] DA---------                         ▌  STORE_LCL_VAR byref  V04 tmp2         // new store of IL stack entries for clique BB02, BB03 -> BB04
N001 (  3,  3) [000000] -----------                         └──▌  LCL_ADDR  byref  V01 loc0         [+0]
                                                               ▌    int    V01.Program+S:Opts (offs=0x00) -> V06 tmp4        

***** BB02
STMT00007 ( ??? ... ??? )
N002 (  5,  4) [000022] DA---------                         ▌  STORE_LCL_VAR int    V05 tmp3         
N001 (  1,  1) [000019] -----------                         └──▌  CNS_INT   int    2

------------ BB03 [008..009), preds={BB01} succs={BB04}

***** BB03
STMT00002 ( ??? ... 0x008 )
N002 (  7,  5) [000010] DA---------                         ▌  STORE_LCL_VAR byref  V04 tmp2         // new store of IL stack entries for clique BB02, BB03 -> BB04
N001 (  3,  3) [000000] -----------                         └──▌  LCL_ADDR  byref  V01 loc0         [+0]
                                                               ▌    int    V01.Program+S:Opts (offs=0x00) -> V06 tmp4      

***** BB03
STMT00003 ( ??? ... ??? )
N002 (  5,  4) [000012] DA---------                         ▌  STORE_LCL_VAR int    V05 tmp3         
N001 (  1,  1) [000009] -----------                         └──▌  CNS_INT   int    1

------------ BB04 [009..00F) (return), preds={BB02,BB03} succs={}

***** BB04
STMT00004 ( ??? ... 0x009 )
N003 ( 10,  7) [000017] -A-XG------                         ▌  STOREIND  int   
N001 (  3,  2) [000014] -----------                         ├──▌  LCL_VAR   byref  V04 tmp2         
N002 (  3,  2) [000015] -----------                         └──▌  LCL_VAR   int    V05 tmp3         

We could potentially handle this when importing BB04 by checking the predecessor assignments into the spill temps. But we would need to try harder to import blocks in RPO as we don't import these in the right order to do this today (the import order is BB01 -> BB03 -> BB04 -> BB02).

AndyAyersMS commented 1 year ago

Seems like it would be useful here and in morph (and maybe in more phases) to build a general worklist driven algorithm for processing in RPO. The rough idea would be to first run a DFS to identify a DAG (or adopt some other DAG-identifying convention like all lexically backward edges are back edges) and then set up a priority queue (or similar) on blocks with the priority being the number of unvisited preds.

Then process all the priority 0 nodes. As each node finishes decrement the priority of each successor reached along a non-backedge, until the queue is empty. If the phase trims away an edge because of optimization, then also decrement the successor count (if a non-backedge).

If new blocks are added or flow is altered things get tricker. Not a problem for the importer but morph can do various flow edits so we'd need to be careful.

We would need a priority queue implementation that is not to allocation-happy, but we can likely use a free list for the internal data structures so can keep the allocation proportional to the number of blocks and perhaps amortize that over multiple traversals. Or since we really only care about nodes of priority zero, just keep a block->count map and whenever a block's count gets to zero, move it to the worklist.

jakobbotsch commented 1 year ago

I have a desire to fix this in 9.0 by introducing some limited flow-sensitive propagation of LCL_ADDR values as part of local morph. It would fix this issue and it would improve many cases where inlining can end up spilling LCL_ADDR nodes (e.g. #69254, many of the Matrix4x4 and Matrix3x2 methods).

AndyAyersMS commented 1 year ago

This could also be handled via tail duplication, basically move (and duplicate) the storeind backwards into its preds, based on the intersection of the variables it reads and the constants available for those locals in the preds. I am playing around with this idea in some other contexts...