Open AndyAyersMS opened 3 years ago
Phi-case mentioned above:
; Assembly listing for method System.Type:get_IsInterface():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO != 0
; Final local variable assignments
;
; V00 this [V00,T01] ( 5, 4 ) ref -> rcx this class-hnd
; V01 loc0 [V01,T02] ( 3, 2.50) ref -> rax class-hnd
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T00] ( 5, 6.75) ref -> rax class-hnd "spilling QMark2"
;
; Lcl frame size = 40
G_M8876_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M8876_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, isz
; gcrRegs +[rcx]
mov rax, rcx
; gcrRegs +[rax]
test rax, rax
je SHORT G_M8876_IG05
;; bbWeight=1 PerfScore 1.50
G_M8876_IG03: ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
mov rdx, 0xD1FFAB1E
cmp qword ptr [rax], rdx
je SHORT G_M8876_IG05
;; bbWeight=0.25 PerfScore 0.81
G_M8876_IG04: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref
; gcrRegs -[rax]
xor rax, rax
; gcrRegs +[rax]
;; bbWeight=0.12 PerfScore 0.03
G_M8876_IG05: ; gcrefRegs=00000003 {rax rcx}, byrefRegs=00000000 {}, byref, isz
test rax, rax
je SHORT G_M8876_IG08
;; bbWeight=1 PerfScore 1.25
G_M8876_IG06: ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref
; gcrRegs -[rcx]
mov rcx, rax
; gcrRegs +[rcx]
;; bbWeight=0.50 PerfScore 0.12
G_M8876_IG07: ; , epilog, nogc, extend
add rsp, 40
jmp System.RuntimeTypeHandle:IsInterface()
; gcr arg pop 0
;; bbWeight=0.50 PerfScore 1.12
G_M8876_IG08: ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref
; gcrRegs -[rax]
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+120]
call qword ptr [rax]hackishModuleName:hackishMethodName()
; gcrRegs -[rcx]
; gcr arg pop 0
test al, 32
setne al
movzx rax, al
;; bbWeight=0.50 PerfScore 4.25
G_M8876_IG09: ; , epilog, nogc, extend
add rsp, 40
ret
;; bbWeight=0.50 PerfScore 0.62
The test in IG05 is redundant; there are 3 preds, two have EAX == null
and the third EAX != null
. But the local test VN is based on a PHI and so does not match any dominating VN.
If we were to chase the phi defs we'd find that substituting those defs the VNs into the EQ
we'd find matching dominating compare VNs and would attempt jump threading, which would succeed and all preds could bypass IG05 (targeting IG06 or IG08 as appropriate).
N001 [000020] LCL_VAR V03 tmp1 u:2 (last use) => $1c0 {PhiDef($3, $2, $143)}
N002 [000021] CNS_INT null => $VN.Null
N003 [000022] EQ => $103 {EQ($1c0, $0)}
***** BB04, STMT00002(after)
N004 ( 5, 5) [000023] ------------ * JTRUE void
N003 ( 3, 3) [000022] J------N---- \--* EQ int $103
N001 ( 1, 1) [000020] ------------ +--* LCL_VAR ref V03 tmp1 u:2 (last use) $1c0
N002 ( 1, 1) [000021] ------------ \--* CNS_INT ref null $VN.Null
Logically we could also envision this as a a PHI-EQ interchange, that is instead of (EQ(PHI(...)) we equivalently have PHI(EQ(...)) and those inner EQs are the ones with VN matches.
Looking at handling side effects in jump threading.
Here's the distribution of costs for blocks that have jump threading across SPMI.
A cost limit of 20 (in units of GetCostSz
) would get most occurrences.
Recall for various reasons we want to be able to run this optimization without introducing new basic blocks, if possible.
Seemingly in most cases we could get by with making just one copy of the code. Assuming no ambiguous preds, then one set of preds could continue to target the current block and use the copy that's already there; for the other set we could put a copy at the start of the relevant successor block (if viable) or if not, if there's just one pred in that set, at the end of that pred (if viable). I haven't done this bit of the screening yet so it may prove overly limiting. For example in the below we duplicate side effect S
and add a copy to the C
.
We could remove the (if viable)
by ensuring that we split all critical edges before running the optimizer; then we would always be able to place the copy before the appropriate successor.
Another example of "handle cases where the redundant compare block has side effects":
static bool Test(string s)
{
if (s == null)
{
Console.WriteLine("11");
}
Console.WriteLine("22");
if (s == null)
{
Console.WriteLine("33");
}
return true;
}
Console.WriteLine("22")
is considered as a side-effect and brakes jump-threading
43811 introduced a redundant branch optimization, further enhanced in #46237 and #46257.
There are still more improvements possible. Here's a partial list of ideas, issues, and improvements:
x > 0
dominatingx == 0
).fgOptimizeUncondBranchToSimpleCond
x > 5
dominatingx > 3
)35348
36649
44040
46887
47920
QMARK
. We see this inType.IsInstance
, for instance.61023
61275
37986
75987
85546
AND
,OR
(would guessNOT
is already handled by VN?).optJumpThread
on the basic block epochoptReachable
on the visited BB flagsCheck
method and elsewhere eliminate most of the places where an exception could sneak past...?category:cq theme:redundant-branches skill-level:expert cost:large impact:medium