dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Redundant Branch Opts Enhancements #48115

Open AndyAyersMS opened 3 years ago

AndyAyersMS commented 3 years ago

43811 introduced a redundant branch optimization, further enhanced in #46237 and #46257.

There are still more improvements possible. Here's a partial list of ideas, issues, and improvements:

category:cq theme:redundant-branches skill-level:expert cost:large impact:medium

AndyAyersMS commented 3 years ago

Phi-case mentioned above:

; Assembly listing for method System.Type:get_IsInterface():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO != 0
; Final local variable assignments
;
;  V00 this         [V00,T01] (  5,  4   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T02] (  3,  2.50)     ref  ->  rax         class-hnd
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T00] (  5,  6.75)     ref  ->  rax         class-hnd "spilling QMark2"
;
; Lcl frame size = 40

G_M8876_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
                        ;; bbWeight=1    PerfScore 0.25
G_M8876_IG02:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, isz
       ; gcrRegs +[rcx]
       mov      rax, rcx
       ; gcrRegs +[rax]
       test     rax, rax
       je       SHORT G_M8876_IG05
                        ;; bbWeight=1    PerfScore 1.50
G_M8876_IG03:        ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
       mov      rdx, 0xD1FFAB1E
       cmp      qword ptr [rax], rdx
       je       SHORT G_M8876_IG05
                        ;; bbWeight=0.25 PerfScore 0.81
G_M8876_IG04:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref
       ; gcrRegs -[rax]
       xor      rax, rax
       ; gcrRegs +[rax]
                        ;; bbWeight=0.12 PerfScore 0.03
G_M8876_IG05:        ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
       test     rax, rax
       je       SHORT G_M8876_IG08
                        ;; bbWeight=1    PerfScore 1.25
G_M8876_IG06:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref
       ; gcrRegs -[rcx]
       mov      rcx, rax
       ; gcrRegs +[rcx]
                        ;; bbWeight=0.50 PerfScore 0.12
G_M8876_IG07:        ; , epilog, nogc, extend
       add      rsp, 40
       jmp      System.RuntimeTypeHandle:IsInterface()
       ; gcr arg pop 0
                        ;; bbWeight=0.50 PerfScore 1.12
G_M8876_IG08:        ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref
       ; gcrRegs -[rax]
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+120]
       call     qword ptr [rax]hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx]
       ; gcr arg pop 0
       test     al, 32
       setne    al
       movzx    rax, al
                        ;; bbWeight=0.50 PerfScore 4.25
G_M8876_IG09:        ; , epilog, nogc, extend
       add      rsp, 40
       ret      
                        ;; bbWeight=0.50 PerfScore 0.62

The test in IG05 is redundant; there are 3 preds, two have EAX == null and the third EAX != null. But the local test VN is based on a PHI and so does not match any dominating VN.

If we were to chase the phi defs we'd find that substituting those defs the VNs into the EQ we'd find matching dominating compare VNs and would attempt jump threading, which would succeed and all preds could bypass IG05 (targeting IG06 or IG08 as appropriate).

N001 [000020]   LCL_VAR   V03 tmp1         u:2 (last use) => $1c0 {PhiDef($3, $2, $143)}
N002 [000021]   CNS_INT   null => $VN.Null
N003 [000022]   EQ        => $103 {EQ($1c0, $0)}

***** BB04, STMT00002(after)
N004 (  5,  5) [000023] ------------              *  JTRUE     void  
N003 (  3,  3) [000022] J------N----              \--*  EQ        int    $103
N001 (  1,  1) [000020] ------------                 +--*  LCL_VAR   ref    V03 tmp1         u:2 (last use) $1c0
N002 (  1,  1) [000021] ------------                 \--*  CNS_INT   ref    null $VN.Null

Logically we could also envision this as a a PHI-EQ interchange, that is instead of (EQ(PHI(...)) we equivalently have PHI(EQ(...)) and those inner EQs are the ones with VN matches.

AndyAyersMS commented 3 years ago

Looking at handling side effects in jump threading.

Here's the distribution of costs for blocks that have jump threading across SPMI.

image

A cost limit of 20 (in units of GetCostSz) would get most occurrences.

Recall for various reasons we want to be able to run this optimization without introducing new basic blocks, if possible.

Seemingly in most cases we could get by with making just one copy of the code. Assuming no ambiguous preds, then one set of preds could continue to target the current block and use the copy that's already there; for the other set we could put a copy at the start of the relevant successor block (if viable) or if not, if there's just one pred in that set, at the end of that pred (if viable). I haven't done this bit of the screening yet so it may prove overly limiting. For example in the below we duplicate side effect S and add a copy to the C.

We could remove the (if viable) by ensuring that we split all critical edges before running the optimizer; then we would always be able to place the copy before the appropriate successor.

EgorBo commented 2 years ago

Another example of "handle cases where the redundant compare block has side effects":

static bool Test(string s)
{
    if (s == null)
    {
        Console.WriteLine("11");
    }

    Console.WriteLine("22");

    if (s == null)
    {
        Console.WriteLine("33");
    }
    return true;
}

Console.WriteLine("22") is considered as a side-effect and brakes jump-threading