Open AndyAyersMS opened 6 years ago
For SumFor
:
; Assembly listing for method P:SumFor(ref):int
G_M43955_IG01:
4883EC28 sub rsp, 40
G_M43955_IG02:
33C0 xor eax, eax // result
33D2 xor edx, edx // i
448B4118 mov r8d, dword ptr [rcx+24] // _size
4585C0 test r8d, r8d
7E1E jle SHORT G_M43955_IG05
G_M43955_IG03:
413BD0 cmp edx, r8d
731E jae SHORT G_M43955_IG06
G_M43955_IG04:
4C8B4908 mov r9, gword ptr [rcx+8] // _items
413B5108 cmp edx, dword ptr [r9+8] // array len
731A jae SHORT G_M43955_IG07
4C63D2 movsxd r10, edx
4303449110 add eax, dword ptr [r9+4*r10+16]
FFC2 inc edx
413BD0 cmp edx, r8d
7CE2 jl SHORT G_M43955_IG03
G_M43955_IG05:
4883C428 add rsp, 40
C3 ret
G_M43955_IG06:
E85758A35E call System.ThrowHelper:ThrowArgumentOutOfRange_IndexException()
CC int3
G_M43955_IG07:
E831D8365F call CORINFO_HELP_RNGCHKFAIL
CC
IG03 is a critical block (fork/join). The join limits the jit's ability to reason about what happens in this block.
When reaching IG03 from IG02: r8d > 0, edx = 0. So edx < r8d. So the branch to IG06 is not taken.
When reaching IG03 from IG04: edx < r8d. So the branch to IG06 is not taken.
It might be simpler for the jit to reason about IG03 if it simply cloned it to push the join down to IG04.
IG03's branch makes IG04 "conditionally executed" in the loop. So the jit won't consider hoisting invariants like "mov r9, gword ptr [rcx+8]".
There would be more value in inlining MoveNext
; if localList._size
could be compared to localList._items.Length
to prove the loop would never go out of bounds (and for that check to then be hoisted)
However I think the range check elimination currently only works evaluating to .Length
not to a value tested to be <= .Length
e.g. _size
https://github.com/dotnet/coreclr/pull/14030#issuecomment-331048497
For SumForeach
:
; method P:SumFor(ref):int
; MoveNext* with AggressiveInline
G_M41030_IG01:
4883EC28 sub rsp, 40
G_M41030_IG02:
33C0 xor eax, eax // i
8B11 mov edx, dword ptr [rcx]
488BD1 mov rdx, rcx
8B4A1C mov ecx, dword ptr [rdx+28] // _version
448BC1 mov r8d, ecx
448BC8 mov r9d, eax
G_M41030_IG03:
EB03 jmp SHORT G_M41030_IG05
G_M41030_IG04:
4103C2 add eax, r10d
G_M41030_IG05:
443BC1 cmp r8d, ecx
7523 jne SHORT G_M41030_IG06
443B4A18 cmp r9d, dword ptr [rdx+24] // _index
731D jae SHORT G_M41030_IG06
4C8B5208 mov r10, gword ptr [rdx+8] // _list
453B4A08 cmp r9d, dword ptr [r10+8]
7335 jae SHORT G_M41030_IG11
4D63D9 movsxd r11, r9d
478B549A10 mov r10d, dword ptr [r10+4*r11+16]
41FFC1 inc r9d
41BB01000000 mov r11d, 1
EB12 jmp SHORT G_M41030_IG08
G_M41030_IG06:
443BC1 cmp r8d, ecx
7517 jne SHORT G_M41030_IG10
G_M41030_IG07:
448B4A18 mov r9d, dword ptr [rdx+24]
41FFC1 inc r9d
4533D2 xor r10d, r10d
4533DB xor r11d, r11d
G_M41030_IG08:
4585DB test r11d, r11d
75BE jne SHORT G_M41030_IG04
G_M41030_IG09:
4883C428 add rsp, 40
C3 ret
G_M41030_IG10:
E87E5FA35E call System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
CC int3
G_M41030_IG11:
E8A8D7365F call CORINFO_HELP_RNGCHKFAIL
CC int3
IG08 is a also a critical block (testing the result of MoveNext):
When reached from IG07 r11d
is 0 and so the jump is not taken.
When reached from IG05, r11d
is 1, so the jump is always taken.
By cloning this block we can streamline both branches.
The jit already realizes the compare and branch at the top of IG05 and IG06 has known outcome but doesn't act on it for some reason (perhaps because we're in SSA and don't want to mess with control flow?):
optValnumCSE morphed tree:
N004 ( 5, 5) [000115] ----G------- * JTRUE void
N002 ( 1, 1) [000310] ------------ | /--* LCL_VAR int V17 cse0 <l:$281, c:$103>
N003 ( 3, 3) [000114] N---G--N-U-- \--* NE int <l:$40, c:$284>
N001 ( 1, 1) [000111] ------------ \--* LCL_VAR int V09 tmp5 u:3 <l:$281, c:$103>
optValnumCSE morphed tree:
N004 ( 5, 5) [000179] ----G------- * JTRUE void
N002 ( 1, 1) [000311] ------------ | /--* LCL_VAR int V17 cse0 <l:$281, c:$103>
N003 ( 3, 3) [000178] J---G--N---- \--* NE int <l:$40, c:$289>
N001 ( 1, 1) [000173] ------------ \--* LCL_VAR int V09 tmp5 u:3 <l:$281, c:$103>
note the two operands have the same value numbers.
Fixing all that would leave us with an inner loop something like:
G_M41030_IG04:
4103C2 add eax, r10d
G_M41030_IG05:
443B4A18 cmp r9d, dword ptr [rdx+24] // _index
731D jae SHORT G_M41030_IG10
4C8B5208 mov r10, gword ptr [rdx+8] // _list
453B4A08 cmp r9d, dword ptr [r10+8]
7335 jae SHORT G_M41030_IG11
4D63D9 movsxd r11, r9d
478B549A10 mov r10d, dword ptr [r10+4*r11+16]
41FFC1 inc r9d
EB12 jmp SHORT G_M41030_IG04
where we then might be able to host some of the loads off of rdx
and we'd have code that was close to the for
version.
Codegen today:
; Assembly listing for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T04] ( 4, 7 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T02] ( 4, 10 ) int -> rax
;* V02 loc1 [V02,T05] ( 0, 0 ) int -> zero-ref
; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 3, 24 ) ref -> r10 "arr expr"
; V05 cse0 [V05,T03] ( 4, 10 ) int -> r8 "CSE #01: aggressive"
; V06 rat0 [V06,T00] ( 7, 25 ) long -> rdx "Widened IV V02"
;
; Lcl frame size = 40
G_M46421_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M46421_IG02: ;; offset=0x0004
xor eax, eax
xor edx, edx
mov r8d, dword ptr [rcx+0x10]
test r8d, r8d
jle SHORT G_M46421_IG04
align [15 bytes for IG03]
;; size=28 bbWeight=1 PerfScore 4.00
G_M46421_IG03: ;; offset=0x0020
cmp edx, r8d
jae SHORT G_M46421_IG05
mov r10, gword ptr [rcx+0x08]
cmp edx, dword ptr [r10+0x08]
jae SHORT G_M46421_IG06
add eax, dword ptr [r10+4*rdx+0x10]
inc edx
cmp edx, r8d
jl SHORT G_M46421_IG03
;; size=27 bbWeight=4 PerfScore 47.00
G_M46421_IG04: ;; offset=0x003B
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
G_M46421_IG05: ;; offset=0x0040
call [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessException()]
int3
;; size=7 bbWeight=0 PerfScore 0.00
G_M46421_IG06: ;; offset=0x0047
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 77, prolog size 4, PerfScore 52.50, instruction count 22, allocated bytes for code 77 (MethodHash=44f14aaa) for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================
; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T07] ( 3, 3 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T04] ( 4, 12.27) int -> rax
;* V02 loc1 [V02 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2 [V03 ] ( 0, 0 ) int -> zero-ref
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2 [V06,T06] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V07 tmp3 [V07 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
; V08 tmp4 [V08,T03] ( 3, 12.63) ref -> rcx single-def "field V02._list (fldOffset=0x0)" P-INDEP
; V09 tmp5 [V09,T00] ( 6, 23.54) int -> rdx "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6 [V10,T11] ( 0, 0 ) int -> zero-ref single-def "field V02._version (fldOffset=0xc)" P-INDEP
; V11 tmp7 [V11,T05] ( 2, 7.27) int -> r8 "field V02._current (fldOffset=0x10)" P-INDEP
; V12 tmp8 [V12,T08] ( 3, 3 ) ref -> rcx single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9 [V13,T12] ( 0, 0 ) int -> zero-ref single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10 [V14,T09] ( 0, 0 ) int -> zero-ref single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11 [V15 ] ( 0, 0 ) int -> zero-ref single-def "field V05._current (fldOffset=0x10)" P-INDEP
; V16 tmp12 [V16,T01] ( 3, 21.81) ref -> r8 "arr expr"
;* V17 cse0 [V17,T10] ( 0, 0 ) int -> zero-ref "CSE #01: aggressive"
; V18 cse1 [V18,T02] ( 2, 16 ) int -> r8 "CSE #02: aggressive"
;
; Lcl frame size = 40
G_M40154_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x14]
xor edx, edx
jmp SHORT G_M40154_IG04
align [0 bytes for IG03]
;; size=9 bbWeight=1 PerfScore 4.50
G_M40154_IG03: ;; offset=0x000D
add eax, r8d
;; size=3 bbWeight=3.63 PerfScore 0.91
G_M40154_IG04: ;; offset=0x0010
mov r8d, dword ptr [rcx+0x10]
cmp edx, r8d
jae SHORT G_M40154_IG06
;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG05: ;; offset=0x0019
mov r8, gword ptr [rcx+0x08]
cmp edx, dword ptr [r8+0x08]
jae SHORT G_M40154_IG07
mov r10d, edx
mov r8d, dword ptr [r8+4*r10+0x10]
inc edx
jmp SHORT G_M40154_IG03
;; size=22 bbWeight=3.63 PerfScore 38.16
G_M40154_IG06: ;; offset=0x002F
add rsp, 40
ret
;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG07: ;; offset=0x0034
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 58, prolog size 4, PerfScore 74.82, instruction count 21, allocated bytes for code 58 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================
Looks much better today, but there still seems to be a redundant compare in SumFor
and some quite suboptimal block layout in SumForeach
(might be an interesting case to look at for block layout, cc @amanasifkhalid as well).
We also seemingly don't manage to widen the IV for the foreach version. I opened #101176 for this.
With DOTNET_JitDoReversePostOrderLayout=1
, SumFor
is unchanged, and SumForeach
is marginally better:
; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T07] ( 3, 3 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T04] ( 4, 12.27) int -> rax
;* V02 loc1 [V02 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2 [V03 ] ( 0, 0 ) int -> zero-ref
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2 [V06,T06] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V07 tmp3 [V07 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
; V08 tmp4 [V08,T03] ( 3, 12.63) ref -> rcx single-def "field V02._list (fldOffset=0x0)" P-INDEP
; V09 tmp5 [V09,T00] ( 6, 23.54) int -> rdx "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6 [V10,T11] ( 0, 0 ) int -> zero-ref single-def "field V02._version (fldOffset=0xc)" P-INDEP
; V11 tmp7 [V11,T05] ( 2, 7.27) int -> r8 "field V02._current (fldOffset=0x10)" P-INDEP
; V12 tmp8 [V12,T08] ( 3, 3 ) ref -> rcx single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9 [V13,T12] ( 0, 0 ) int -> zero-ref single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10 [V14,T09] ( 0, 0 ) int -> zero-ref single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11 [V15 ] ( 0, 0 ) int -> zero-ref single-def "field V05._current (fldOffset=0x10)" P-INDEP
; V16 tmp12 [V16,T01] ( 3, 21.81) ref -> r8 "arr expr"
;* V17 cse0 [V17,T10] ( 0, 0 ) int -> zero-ref "CSE #01: aggressive"
; V18 cse1 [V18,T02] ( 2, 16 ) int -> r8 "CSE #02: aggressive"
;
; Lcl frame size = 40
G_M40154_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x14]
xor edx, edx
align [0 bytes for IG03]
;; size=7 bbWeight=1 PerfScore 2.50
G_M40154_IG03: ;; offset=0x000B
mov r8d, dword ptr [rcx+0x10]
cmp edx, r8d
jae SHORT G_M40154_IG05
;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG04: ;; offset=0x0014
mov r8, gword ptr [rcx+0x08]
cmp edx, dword ptr [r8+0x08]
jae SHORT G_M40154_IG06
mov r10d, edx
mov r8d, dword ptr [r8+4*r10+0x10]
inc edx
add eax, r8d
jmp SHORT G_M40154_IG03
;; size=25 bbWeight=3.63 PerfScore 39.07
G_M40154_IG05: ;; offset=0x002D
add rsp, 40
ret
;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG06: ;; offset=0x0032
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 56, prolog size 4, PerfScore 72.82, instruction count 20, allocated bytes for code 56 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================
Looking at the JIT dump, this looks like another case of the loop shape issue mentioned in #102343:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..009)-> BB02(1) (always) i nullcheck
BB02 [0001] 1 BB01 1 [009..00B)-> BB04(1) (always) i keep
BB04 [0003] 2 BB02,BB03 8 800 [017..018)-> BB07(0.0914),BB06(0.909) ( cond ) i IBC bwd bwd-src
BB06 [0014] 1 BB04 3.63 363 [017..018)-> BB03(1) (always) i IBC idxlen bwd
BB03 [0002] 1 BB06 3.63 363 [00B..017)-> BB04(1) (always) i IBC loophead bwd bwd-target
BB07 [0015] 1 BB04 4 400 [017..032) (return) i IBC bwd
BB11 [0022] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
The loop rotation issue I noted above is fixed by #102343. Here's the block layout of SumForeach
using the old layout implementation:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..009)-> BB02(1) (always) i nullcheck
BB02 [0001] 1 BB01 1 [009..00B)-> BB04(1) (always) i keep
BB03 [0002] 1 BB06 3.63 363 [00B..017)-> BB04(1) (always) i IBC loophead bwd bwd-target
BB04 [0003] 2 BB02,BB03 8 800 [017..018)-> BB07(0.0914),BB06(0.909) ( cond ) i IBC bwd bwd-src
BB06 [0014] 1 BB04 3.63 363 [017..018)-> BB03(1) (always) i IBC idxlen bwd
BB07 [0015] 1 BB04 4 400 [017..032) (return) i IBC bwd
BB11 [0022] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
With the new implementation, here's the initial RPO layout from above:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..009)-> BB02(1) (always) i nullcheck
BB02 [0001] 1 BB01 1 [009..00B)-> BB04(1) (always) i keep
BB04 [0003] 2 BB02,BB03 8 800 [017..018)-> BB07(0.0914),BB06(0.909) ( cond ) i IBC bwd bwd-src
BB06 [0014] 1 BB04 3.63 363 [017..018)-> BB03(1) (always) i IBC idxlen bwd
BB03 [0002] 1 BB06 3.63 363 [00B..017)-> BB04(1) (always) i IBC loophead bwd bwd-target
BB07 [0015] 1 BB04 4 400 [017..032) (return) i IBC bwd
BB11 [0022] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Then after we fix up the backward jumps, we end up with this layout:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..009)-> BB02(1) (always) i nullcheck
BB02 [0001] 1 BB01 1 [009..00B)-> BB04(1) (always) i keep
BB03 [0002] 1 BB06 3.63 363 [00B..017)-> BB04(1) (always) i IBC loophead bwd bwd-target
BB04 [0003] 2 BB02,BB03 8 800 [017..018)-> BB07(0.0914),BB06(0.909) ( cond ) i IBC bwd bwd-src
BB06 [0014] 1 BB04 3.63 363 [017..018)-> BB03(1) (always) i IBC idxlen bwd
BB07 [0015] 1 BB04 4 400 [017..032) (return) i IBC bwd
BB11 [0022] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
So we're back to square one. We can't use fgOptimizeBranchToEmptyUnconditional
to get rid of BB06
because it isn't empty. However, after setting up the RPO-based layout, there is an opportunity to compact BB06
and BB03
, and then move the loop head to the top of the loop as usual. This would allow BB04
to fall through upon loop exit. That might seem like an odd transformation to make during layout instead of in fgUpdateFlowGraph
, but it does unlock some branch removal opportunities, and we have TP to spare. If I add a pass to compact blocks after getting the RPO-based layout, here's how the new final layout looks:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..009)-> BB02(1) (always) i nullcheck
BB02 [0001] 1 BB01 1 [009..00B)-> BB04(1) (always) i keep
BB06 [0014] 1 BB04 3.63 363 [00B..018)-> BB04(1) (always) i IBC idxlen bwd
BB04 [0003] 2 BB02,BB06 8 800 [017..018)-> BB07(0.0914),BB06(0.909) ( cond ) i IBC bwd bwd-src
BB07 [0015] 1 BB04 4 400 [017..032) (return) i IBC bwd
BB11 [0022] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
And final codegen:
; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T07] ( 3, 3 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T04] ( 4, 12.27) int -> rax
;* V02 loc1 [V02 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2 [V03 ] ( 0, 0 ) int -> zero-ref
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2 [V06,T06] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V07 tmp3 [V07 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
; V08 tmp4 [V08,T03] ( 3, 12.63) ref -> rcx single-def "field V02._list (fldOffset=0x0)" P-INDEP
; V09 tmp5 [V09,T00] ( 6, 23.54) int -> rdx "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6 [V10,T11] ( 0, 0 ) int -> zero-ref single-def "field V02._version (fldOffset=0xc)" P-INDEP
; V11 tmp7 [V11,T05] ( 2, 7.27) int -> r8 "field V02._current (fldOffset=0x10)" P-INDEP
; V12 tmp8 [V12,T08] ( 3, 3 ) ref -> rcx single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9 [V13,T12] ( 0, 0 ) int -> zero-ref single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10 [V14,T09] ( 0, 0 ) int -> zero-ref single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11 [V15 ] ( 0, 0 ) int -> zero-ref single-def "field V05._current (fldOffset=0x10)" P-INDEP
; V16 tmp12 [V16,T01] ( 3, 21.81) ref -> r8 "arr expr"
;* V17 cse0 [V17,T10] ( 0, 0 ) int -> zero-ref "CSE #01: aggressive"
; V18 cse1 [V18,T02] ( 2, 16 ) int -> r8 "CSE #02: aggressive"
;
; Lcl frame size = 40
G_M40154_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x14]
xor edx, edx
jmp SHORT G_M40154_IG04
align [3 bytes for IG03]
;; size=12 bbWeight=1 PerfScore 4.50
G_M40154_IG03: ;; offset=0x0010
mov r8, gword ptr [rcx+0x08]
cmp edx, dword ptr [r8+0x08]
jae SHORT G_M40154_IG06
mov r10d, edx
mov r8d, dword ptr [r8+4*r10+0x10]
inc edx
add eax, r8d
;; size=23 bbWeight=3.63 PerfScore 31.80
G_M40154_IG04: ;; offset=0x0027
mov r8d, dword ptr [rcx+0x10]
cmp edx, r8d
jb SHORT G_M40154_IG03
;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG05: ;; offset=0x0030
add rsp, 40
ret
;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG06: ;; offset=0x0035
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 59, prolog size 4, PerfScore 67.55, instruction count 20, allocated bytes for code 59 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================
I think this warrants a follow-up to #102461.
While unrelated to this issue, we don't handle the BBJ_COND
to empty BBJ_ALWAYS
case with fgOptimizeBranchToEmptyUnconditional
just yet, due to the old block layout's dependence on implicit fallthrough from conditional blocks into their false targets. Once #102343 is merged, I think we should pursue that in fgUpdateFlowGraph
.
From some discussion over in dotnet/csharplang#1085. Both the
for
andforeach
cases have suboptimal CQ.In the
for
case there is a redundant branch in the inner loop.In the
foreach
case the enumerator'sMoveNext
is not inlined. If we force this andMoveNextRare
inline then the jit is not able to optimize away the version check overhead in the inner loop.Without the inlines the
foreach
is rougly 2.9x slower than thefor
; with the inlines this drops to 1.3x or so.category:cq theme:basic-cq skill-level:expert cost:medium impact:large