AndyAyersMS commented 3 years ago

43811 introduced a redundant branch optimization, further enhanced in #46237 and #46257.

There are still more improvements possible. Here's a partial list of ideas, issues, and improvements:

handle cases where the dominating compare is a different relop (eg x > 0 dominating x == 0).
- Old prototype relop implies relop
- Also perhaps leverage this logic in fgOptimizeUncondBranchToSimpleCond
- https://github.com/dotnet/runtime/issues/72509
- https://github.com/dotnet/runtime/pull/72979
- https://github.com/dotnet/runtime/pull/75804
handle cases where the dominating compare has different operands (eg x > 5 dominating x > 3)
- 35348
- https://github.com/dotnet/runtime/pull/95234
handle cases where the redundant compare block has side effects. This still seems beyond our reach, as it requires code duplication and so may entail rather extensive revisions of the SSA graph. In particular, if the block has non-PHI SSA defs we need to be able to find and update all downstream uses, as well as introducing necessary PHIs. If the block has SSA uses they should be rewritten via PHI disambiguation (which as noted below can fail for some preds). Etc.
- 36649
- 44040
- 46887
- 47920
- Some prep work in #60884
- Some progress towards costing/screening the IR to duplicate in https://github.com/dotnet/runtime/compare/main...AndyAyersMS:RboSideEffect
- Seems like in some (perhaps many) cases the side effect is a simple assignment and we can get rid of it my making forward sub more aggressive, say handling QMARK. We see this in Type.IsInstance, for instance.
- https://github.com/dotnet/runtime/pull/76476
handle cases where the dominated or dominating compare is an internal relop (one not feeding a GT_JTRUE)
- 61023
- 61275
handle cases where the relop or the dominated branch is implicit (bounds check, say). Note bounds check optimizations use conservative VNs (so that we will still be memory safe in the presence of races) while the entire redundant branch optimizer currently uses liberal VNs).
- https://github.com/dotnet/runtime/issues/12571
- or perhaps consider materializing bounds checks branches before running the optimizer. This is a more radical proposal but means we might not need specialized bounds-check specific optimizations.
use VNs more cleverly (instead of identical liberal vns, we can check if dominating exception vn set covers dominated exception vn set, and normal liberal VNs match).
- https://github.com/dotnet/runtime/pull/68447
handle cases where the local compare input VN is a phi that can be destructured to reveal a known VN or a matching VN in a dominating compare
- see note below
- 37986
- 75987
- preparatory steps in
- https://github.com/dotnet/runtime/pull/76108
- https://github.com/dotnet/runtime/pull/76207
- https://github.com/dotnet/runtime/pull/76283
- Note I am also seeing cases now where we could do this optimization, but it is not possible to map from a PHI operand to the corresponding predecessor block, because we only record one PHI operand per SSA def, not one per predecessor. So, if two predecessors bring in the same SSA def they "share" a phi operand, but it only has room for one block number, and we lose the ability to reason about the other predecessor. Not clear how much impact this has or what a plausible fix might be (other than the obvious one of having one phi input per predecessor, which can cause its own headaches if there are huge join points and lots of SSA vars).
- 85546
handle cases where dominating compare is AND, OR (would guess NOT is already handled by VN?).
- See notes on https://github.com/dotnet/runtime/pull/62689.
- https://github.com/dotnet/runtime/pull/69291
run "multiple passes" as removing one redundancy can expose more
- https://github.com/dotnet/runtime/pull/70907
- generalize the mechanism for deciding which blocks to revisit
remove dependence of optJumpThread on the basic block epoch
- see note https://github.com/dotnet/runtime/pull/72440#discussion_r924682252
remove dependence of optReachable on the visited BB flags
- https://github.com/dotnet/runtime/issues/44341#issuecomment-1211340107)
- https://github.com/dotnet/runtime/pull/75990
https://github.com/dotnet/runtime/issues/48609
Look more critically at the modelling of exceptions in jump threading; it seems like we are perhaps not sufficiently rigorous in the screening we do. It could be that the various exclusions in the Check method and elsewhere eliminate most of the places where an exception could sneak past...?
handle predicates in returns (see next item for one case of this)
handle cases where a dominating relop could be rewritten to make a dominated relop unnecessary. See #81220 for an example and some notes. A version of this is prototyped in #83859. Also see
- https://github.com/dotnet/runtime/issues/93708
- https://github.com/dotnet/runtime/issues/98227
Fix issue with duplicating reads: https://github.com/dotnet/runtime/pull/89710
Fix issue with unreachable preds: https://github.com/dotnet/runtime/pull/95556
Generalize jump threading since fallthroughs are no longer a constraint: https://github.com/dotnet/runtime/pull/97722
Extend jump threading to allow skipping back through empty preds, transitively (see https://github.com/dotnet/runtime/pull/98096#issuecomment-1932282106).
Extend jump threading to follow back through linear flow (see https://github.com/dotnet/runtime/issues/4324#issuecomment-2143535219)
Figure out how to avoid introducing impossible results in racing programs (see https://github.com/dotnet/runtime/issues/102158)

category:cq theme:redundant-branches skill-level:expert cost:large impact:medium

AndyAyersMS commented 3 years ago

Phi-case mentioned above:

; Assembly listing for method System.Type:get_IsInterface():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO != 0
; Final local variable assignments
;
;  V00 this         [V00,T01] (  5,  4   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T02] (  3,  2.50)     ref  ->  rax         class-hnd
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T00] (  5,  6.75)     ref  ->  rax         class-hnd "spilling QMark2"
;
; Lcl frame size = 40

G_M8876_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
                        ;; bbWeight=1    PerfScore 0.25
G_M8876_IG02:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, isz
       ; gcrRegs +[rcx]
       mov      rax, rcx
       ; gcrRegs +[rax]
       test     rax, rax
       je       SHORT G_M8876_IG05
                        ;; bbWeight=1    PerfScore 1.50
G_M8876_IG03:        ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
       mov      rdx, 0xD1FFAB1E
       cmp      qword ptr [rax], rdx
       je       SHORT G_M8876_IG05
                        ;; bbWeight=0.25 PerfScore 0.81
G_M8876_IG04:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref
       ; gcrRegs -[rax]
       xor      rax, rax
       ; gcrRegs +[rax]
                        ;; bbWeight=0.12 PerfScore 0.03
G_M8876_IG05:        ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
       test     rax, rax
       je       SHORT G_M8876_IG08
                        ;; bbWeight=1    PerfScore 1.25
G_M8876_IG06:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref
       ; gcrRegs -[rcx]
       mov      rcx, rax
       ; gcrRegs +[rcx]
                        ;; bbWeight=0.50 PerfScore 0.12
G_M8876_IG07:        ; , epilog, nogc, extend
       add      rsp, 40
       jmp      System.RuntimeTypeHandle:IsInterface()
       ; gcr arg pop 0
                        ;; bbWeight=0.50 PerfScore 1.12
G_M8876_IG08:        ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref
       ; gcrRegs -[rax]
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+120]
       call     qword ptr [rax]hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx]
       ; gcr arg pop 0
       test     al, 32
       setne    al
       movzx    rax, al
                        ;; bbWeight=0.50 PerfScore 4.25
G_M8876_IG09:        ; , epilog, nogc, extend
       add      rsp, 40
       ret      
                        ;; bbWeight=0.50 PerfScore 0.62

The test in IG05 is redundant; there are 3 preds, two have EAX == null and the third EAX != null. But the local test VN is based on a PHI and so does not match any dominating VN.

If we were to chase the phi defs we'd find that substituting those defs the VNs into the EQ we'd find matching dominating compare VNs and would attempt jump threading, which would succeed and all preds could bypass IG05 (targeting IG06 or IG08 as appropriate).

N001 [000020]   LCL_VAR   V03 tmp1         u:2 (last use) => $1c0 {PhiDef($3, $2, $143)}
N002 [000021]   CNS_INT   null => $VN.Null
N003 [000022]   EQ        => $103 {EQ($1c0, $0)}

***** BB04, STMT00002(after)
N004 (  5,  5) [000023] ------------              *  JTRUE     void  
N003 (  3,  3) [000022] J------N----              \--*  EQ        int    $103
N001 (  1,  1) [000020] ------------                 +--*  LCL_VAR   ref    V03 tmp1         u:2 (last use) $1c0
N002 (  1,  1) [000021] ------------                 \--*  CNS_INT   ref    null $VN.Null

Logically we could also envision this as a a PHI-EQ interchange, that is instead of (EQ(PHI(...)) we equivalently have PHI(EQ(...)) and those inner EQs are the ones with VN matches.

AndyAyersMS commented 3 years ago

Looking at handling side effects in jump threading.

Here's the distribution of costs for blocks that have jump threading across SPMI.

A cost limit of 20 (in units of GetCostSz) would get most occurrences.

Recall for various reasons we want to be able to run this optimization without introducing new basic blocks, if possible.

Seemingly in most cases we could get by with making just one copy of the code. Assuming no ambiguous preds, then one set of preds could continue to target the current block and use the copy that's already there; for the other set we could put a copy at the start of the relevant successor block (if viable) or if not, if there's just one pred in that set, at the end of that pred (if viable). I haven't done this bit of the screening yet so it may prove overly limiting. For example in the below we duplicate side effect S and add a copy to the C.

We could remove the (if viable) by ensuring that we split all critical edges before running the optimizer; then we would always be able to place the copy before the appropriate successor.

EgorBo commented 2 years ago

Another example of "handle cases where the redundant compare block has side effects":

static bool Test(string s)
{
    if (s == null)
    {
        Console.WriteLine("11");
    }

    Console.WriteLine("22");

    if (s == null)
    {
        Console.WriteLine("33");
    }
    return true;
}

Console.WriteLine("22") is considered as a side-effect and brakes jump-threading

dotnet / runtime

Redundant Branch Opts Enhancements #48115

43811 introduced a redundant branch optimization, further enhanced in #46237 and #46257.

35348

36649

44040

46887

47920

61023

61275

37986

75987

85546