dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

JIT: poor CQ for list enumeration constructs #9304

Open AndyAyersMS opened 6 years ago

AndyAyersMS commented 6 years ago

From some discussion over in dotnet/csharplang#1085. Both the for and foreach cases have suboptimal CQ.

In the for case there is a redundant branch in the inner loop.

In the foreach case the enumerator's MoveNext is not inlined. If we force this and MoveNextRare inline then the jit is not able to optimize away the version check overhead in the inner loop.

Without the inlines the foreach is rougly 2.9x slower than the for; with the inlines this drops to 1.3x or so.

    static int SumFor(List<int> l)
    {
        int result = 0;

        for (int i = 0; i < l.Count; i++)
        {
            result += l[i];
        }

        return result;
    }

    static int SumForeach(List<int> l)
    {
        int result = 0;

        foreach (int i in l)
        {
            result += i;
        }

        return result;
    }

category:cq theme:basic-cq skill-level:expert cost:medium impact:large

AndyAyersMS commented 6 years ago

For SumFor:

; Assembly listing for method P:SumFor(ref):int

G_M43955_IG01:
       4883EC28             sub      rsp, 40

G_M43955_IG02:
       33C0                 xor      eax, eax                   // result
       33D2                 xor      edx, edx                   // i
       448B4118             mov      r8d, dword ptr [rcx+24]    // _size
       4585C0               test     r8d, r8d
       7E1E                 jle      SHORT G_M43955_IG05

G_M43955_IG03:
       413BD0               cmp      edx, r8d
       731E                 jae      SHORT G_M43955_IG06

G_M43955_IG04:
       4C8B4908             mov      r9, gword ptr [rcx+8]      // _items
       413B5108             cmp      edx, dword ptr [r9+8]      // array len
       731A                 jae      SHORT G_M43955_IG07
       4C63D2               movsxd   r10, edx
       4303449110           add      eax, dword ptr [r9+4*r10+16]
       FFC2                 inc      edx
       413BD0               cmp      edx, r8d
       7CE2                 jl       SHORT G_M43955_IG03

G_M43955_IG05:
       4883C428             add      rsp, 40
       C3                   ret      

G_M43955_IG06:
       E85758A35E           call     System.ThrowHelper:ThrowArgumentOutOfRange_IndexException()
       CC                   int3     

G_M43955_IG07:
       E831D8365F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   

IG03 is a critical block (fork/join). The join limits the jit's ability to reason about what happens in this block.

When reaching IG03 from IG02: r8d > 0, edx = 0. So edx < r8d. So the branch to IG06 is not taken.

When reaching IG03 from IG04: edx < r8d. So the branch to IG06 is not taken.

It might be simpler for the jit to reason about IG03 if it simply cloned it to push the join down to IG04.

IG03's branch makes IG04 "conditionally executed" in the loop. So the jit won't consider hoisting invariants like "mov r9, gword ptr [rcx+8]".

benaadams commented 6 years ago

There would be more value in inlining MoveNext; if localList._size could be compared to localList._items.Length to prove the loop would never go out of bounds (and for that check to then be hoisted)

However I think the range check elimination currently only works evaluating to .Length not to a value tested to be <= .Length e.g. _size https://github.com/dotnet/coreclr/pull/14030#issuecomment-331048497

AndyAyersMS commented 6 years ago

For SumForeach:

; method P:SumFor(ref):int
; MoveNext* with AggressiveInline

G_M41030_IG01:
       4883EC28             sub      rsp, 40

G_M41030_IG02:
       33C0                 xor      eax, eax                  // i
       8B11                 mov      edx, dword ptr [rcx]
       488BD1               mov      rdx, rcx
       8B4A1C               mov      ecx, dword ptr [rdx+28]   // _version
       448BC1               mov      r8d, ecx
       448BC8               mov      r9d, eax

G_M41030_IG03:
       EB03                 jmp      SHORT G_M41030_IG05

G_M41030_IG04:
       4103C2               add      eax, r10d

G_M41030_IG05:
       443BC1               cmp      r8d, ecx
       7523                 jne      SHORT G_M41030_IG06
       443B4A18             cmp      r9d, dword ptr [rdx+24]    // _index
       731D                 jae      SHORT G_M41030_IG06
       4C8B5208             mov      r10, gword ptr [rdx+8]     // _list
       453B4A08             cmp      r9d, dword ptr [r10+8]     
       7335                 jae      SHORT G_M41030_IG11
       4D63D9               movsxd   r11, r9d
       478B549A10           mov      r10d, dword ptr [r10+4*r11+16]
       41FFC1               inc      r9d                         
       41BB01000000         mov      r11d, 1
       EB12                 jmp      SHORT G_M41030_IG08

G_M41030_IG06:
       443BC1               cmp      r8d, ecx
       7517                 jne      SHORT G_M41030_IG10

G_M41030_IG07:
       448B4A18             mov      r9d, dword ptr [rdx+24]
       41FFC1               inc      r9d
       4533D2               xor      r10d, r10d
       4533DB               xor      r11d, r11d

G_M41030_IG08:
       4585DB               test     r11d, r11d
       75BE                 jne      SHORT G_M41030_IG04

G_M41030_IG09:
       4883C428             add      rsp, 40
       C3                   ret      

G_M41030_IG10:
       E87E5FA35E           call     System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
       CC                   int3     

G_M41030_IG11:
       E8A8D7365F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

IG08 is a also a critical block (testing the result of MoveNext):

When reached from IG07 r11d is 0 and so the jump is not taken. When reached from IG05, r11d is 1, so the jump is always taken.

By cloning this block we can streamline both branches.

The jit already realizes the compare and branch at the top of IG05 and IG06 has known outcome but doesn't act on it for some reason (perhaps because we're in SSA and don't want to mess with control flow?):

optValnumCSE morphed tree:
N004 (  5,  5)              [000115] ----G-------              *  JTRUE     void  
N002 (  1,  1)              [000310] ------------              |  /--*  LCL_VAR   int    V17 cse0          <l:$281, c:$103>
N003 (  3,  3)              [000114] N---G--N-U--              \--*  NE        int    <l:$40, c:$284>
N001 (  1,  1)              [000111] ------------                 \--*  LCL_VAR   int    V09 tmp5         u:3 <l:$281, c:$103>

optValnumCSE morphed tree:
N004 (  5,  5)              [000179] ----G-------              *  JTRUE     void  
N002 (  1,  1)              [000311] ------------              |  /--*  LCL_VAR   int    V17 cse0          <l:$281, c:$103>
N003 (  3,  3)              [000178] J---G--N----              \--*  NE        int    <l:$40, c:$289>
N001 (  1,  1)              [000173] ------------                 \--*  LCL_VAR   int    V09 tmp5         u:3 <l:$281, c:$103>

note the two operands have the same value numbers.

Fixing all that would leave us with an inner loop something like:

G_M41030_IG04:
       4103C2               add      eax, r10d

G_M41030_IG05:
       443B4A18             cmp      r9d, dword ptr [rdx+24]    // _index
       731D                 jae      SHORT G_M41030_IG10
       4C8B5208             mov      r10, gword ptr [rdx+8]     // _list
       453B4A08             cmp      r9d, dword ptr [r10+8]     
       7335                 jae      SHORT G_M41030_IG11
       4D63D9               movsxd   r11, r9d
       478B549A10           mov      r10d, dword ptr [r10+4*r11+16]
       41FFC1               inc      r9d                         
       EB12                 jmp      SHORT G_M41030_IG04

where we then might be able to host some of the loads off of rdx and we'd have code that was close to the for version.

jakobbotsch commented 5 months ago

Codegen today:

; Assembly listing for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T04] (  4,  7   )     ref  ->  rcx         class-hnd single-def <System.Collections.Generic.List`1[int]>
;  V01 loc0         [V01,T02] (  4, 10   )     int  ->  rax        
;* V02 loc1         [V02,T05] (  0,  0   )     int  ->  zero-ref   
;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V04 tmp1         [V04,T01] (  3, 24   )     ref  ->  r10         "arr expr"
;  V05 cse0         [V05,T03] (  4, 10   )     int  ->   r8         "CSE #01: aggressive"
;  V06 rat0         [V06,T00] (  7, 25   )    long  ->  rdx         "Widened IV V02"
;
; Lcl frame size = 40

G_M46421_IG01:  ;; offset=0x0000
       sub      rsp, 40
                        ;; size=4 bbWeight=1 PerfScore 0.25
G_M46421_IG02:  ;; offset=0x0004
       xor      eax, eax
       xor      edx, edx
       mov      r8d, dword ptr [rcx+0x10]
       test     r8d, r8d
       jle      SHORT G_M46421_IG04
       align    [15 bytes for IG03]
                        ;; size=28 bbWeight=1 PerfScore 4.00
G_M46421_IG03:  ;; offset=0x0020
       cmp      edx, r8d
       jae      SHORT G_M46421_IG05
       mov      r10, gword ptr [rcx+0x08]
       cmp      edx, dword ptr [r10+0x08]
       jae      SHORT G_M46421_IG06
       add      eax, dword ptr [r10+4*rdx+0x10]
       inc      edx
       cmp      edx, r8d
       jl       SHORT G_M46421_IG03
                        ;; size=27 bbWeight=4 PerfScore 47.00
G_M46421_IG04:  ;; offset=0x003B
       add      rsp, 40
       ret      
                        ;; size=5 bbWeight=1 PerfScore 1.25
G_M46421_IG05:  ;; offset=0x0040
       call     [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessException()]
       int3     
                        ;; size=7 bbWeight=0 PerfScore 0.00
G_M46421_IG06:  ;; offset=0x0047
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
                        ;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 77, prolog size 4, PerfScore 52.50, instruction count 22, allocated bytes for code 77 (MethodHash=44f14aaa) for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================
; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T07] (  3,  3   )     ref  ->  rcx         class-hnd single-def <System.Collections.Generic.List`1[int]>
;  V01 loc0         [V01,T04] (  4, 12.27)     int  ->  rax        
;* V02 loc1         [V02    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2         [V03    ] (  0,  0   )     int  ->  zero-ref   
;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2         [V06,T06] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;* V07 tmp3         [V07    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
;  V08 tmp4         [V08,T03] (  3, 12.63)     ref  ->  rcx         single-def "field V02._list (fldOffset=0x0)" P-INDEP
;  V09 tmp5         [V09,T00] (  6, 23.54)     int  ->  rdx         "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6         [V10,T11] (  0,  0   )     int  ->  zero-ref    single-def "field V02._version (fldOffset=0xc)" P-INDEP
;  V11 tmp7         [V11,T05] (  2,  7.27)     int  ->   r8         "field V02._current (fldOffset=0x10)" P-INDEP
;  V12 tmp8         [V12,T08] (  3,  3   )     ref  ->  rcx         single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9         [V13,T12] (  0,  0   )     int  ->  zero-ref    single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10        [V14,T09] (  0,  0   )     int  ->  zero-ref    single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11        [V15    ] (  0,  0   )     int  ->  zero-ref    single-def "field V05._current (fldOffset=0x10)" P-INDEP
;  V16 tmp12        [V16,T01] (  3, 21.81)     ref  ->   r8         "arr expr"
;* V17 cse0         [V17,T10] (  0,  0   )     int  ->  zero-ref    "CSE #01: aggressive"
;  V18 cse1         [V18,T02] (  2, 16   )     int  ->   r8         "CSE #02: aggressive"
;
; Lcl frame size = 40

G_M40154_IG01:  ;; offset=0x0000
       sub      rsp, 40
                        ;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x14]
       xor      edx, edx
       jmp      SHORT G_M40154_IG04
       align    [0 bytes for IG03]
                        ;; size=9 bbWeight=1 PerfScore 4.50
G_M40154_IG03:  ;; offset=0x000D
       add      eax, r8d
                        ;; size=3 bbWeight=3.63 PerfScore 0.91
G_M40154_IG04:  ;; offset=0x0010
       mov      r8d, dword ptr [rcx+0x10]
       cmp      edx, r8d
       jae      SHORT G_M40154_IG06
                        ;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG05:  ;; offset=0x0019
       mov      r8, gword ptr [rcx+0x08]
       cmp      edx, dword ptr [r8+0x08]
       jae      SHORT G_M40154_IG07
       mov      r10d, edx
       mov      r8d, dword ptr [r8+4*r10+0x10]
       inc      edx
       jmp      SHORT G_M40154_IG03
                        ;; size=22 bbWeight=3.63 PerfScore 38.16
G_M40154_IG06:  ;; offset=0x002F
       add      rsp, 40
       ret      
                        ;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG07:  ;; offset=0x0034
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
                        ;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 58, prolog size 4, PerfScore 74.82, instruction count 21, allocated bytes for code 58 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================

Looks much better today, but there still seems to be a redundant compare in SumFor and some quite suboptimal block layout in SumForeach (might be an interesting case to look at for block layout, cc @amanasifkhalid as well).

We also seemingly don't manage to widen the IV for the foreach version. I opened #101176 for this.

amanasifkhalid commented 4 months ago

With DOTNET_JitDoReversePostOrderLayout=1, SumFor is unchanged, and SumForeach is marginally better:

; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T07] (  3,  3   )     ref  ->  rcx         class-hnd single-def <System.Collections.Generic.List`1[int]>
;  V01 loc0         [V01,T04] (  4, 12.27)     int  ->  rax
;* V02 loc1         [V02    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2         [V03    ] (  0,  0   )     int  ->  zero-ref
;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2         [V06,T06] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;* V07 tmp3         [V07    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
;  V08 tmp4         [V08,T03] (  3, 12.63)     ref  ->  rcx         single-def "field V02._list (fldOffset=0x0)" P-INDEP
;  V09 tmp5         [V09,T00] (  6, 23.54)     int  ->  rdx         "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6         [V10,T11] (  0,  0   )     int  ->  zero-ref    single-def "field V02._version (fldOffset=0xc)" P-INDEP
;  V11 tmp7         [V11,T05] (  2,  7.27)     int  ->   r8         "field V02._current (fldOffset=0x10)" P-INDEP
;  V12 tmp8         [V12,T08] (  3,  3   )     ref  ->  rcx         single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9         [V13,T12] (  0,  0   )     int  ->  zero-ref    single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10        [V14,T09] (  0,  0   )     int  ->  zero-ref    single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11        [V15    ] (  0,  0   )     int  ->  zero-ref    single-def "field V05._current (fldOffset=0x10)" P-INDEP
;  V16 tmp12        [V16,T01] (  3, 21.81)     ref  ->   r8         "arr expr"
;* V17 cse0         [V17,T10] (  0,  0   )     int  ->  zero-ref    "CSE #01: aggressive"
;  V18 cse1         [V18,T02] (  2, 16   )     int  ->   r8         "CSE #02: aggressive"
;
; Lcl frame size = 40

G_M40154_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x14]
       xor      edx, edx
       align    [0 bytes for IG03]
                                                ;; size=7 bbWeight=1 PerfScore 2.50
G_M40154_IG03:  ;; offset=0x000B
       mov      r8d, dword ptr [rcx+0x10]
       cmp      edx, r8d
       jae      SHORT G_M40154_IG05
                                                ;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG04:  ;; offset=0x0014
       mov      r8, gword ptr [rcx+0x08]
       cmp      edx, dword ptr [r8+0x08]
       jae      SHORT G_M40154_IG06
       mov      r10d, edx
       mov      r8d, dword ptr [r8+4*r10+0x10]
       inc      edx
       add      eax, r8d
       jmp      SHORT G_M40154_IG03
                                                ;; size=25 bbWeight=3.63 PerfScore 39.07
G_M40154_IG05:  ;; offset=0x002D
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG06:  ;; offset=0x0032
       call     CORINFO_HELP_RNGCHKFAIL
       int3
                                                ;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 56, prolog size 4, PerfScore 72.82, instruction count 20, allocated bytes for code 56 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================

Looking at the JIT dump, this looks like another case of the loop shape issue mentioned in #102343:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..009)-> BB02(1)                 (always)                     i nullcheck
BB02 [0001]  1       BB01                  1        [009..00B)-> BB04(1)                 (always)                     i keep
BB04 [0003]  2       BB02,BB03             8    800 [017..018)-> BB07(0.0914),BB06(0.909)  ( cond )                     i IBC bwd bwd-src
BB06 [0014]  1       BB04                  3.63 363 [017..018)-> BB03(1)                 (always)                     i IBC idxlen bwd
BB03 [0002]  1       BB06                  3.63 363 [00B..017)-> BB04(1)                 (always)                     i IBC loophead bwd bwd-target
BB07 [0015]  1       BB04                  4    400 [017..032)                           (return)                     i IBC bwd
BB11 [0022]  0                             0        [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
amanasifkhalid commented 4 months ago

The loop rotation issue I noted above is fixed by #102343. Here's the block layout of SumForeach using the old layout implementation:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..009)-> BB02(1)                 (always)                     i nullcheck
BB02 [0001]  1       BB01                  1        [009..00B)-> BB04(1)                 (always)                     i keep
BB03 [0002]  1       BB06                  3.63 363 [00B..017)-> BB04(1)                 (always)                     i IBC loophead bwd bwd-target
BB04 [0003]  2       BB02,BB03             8    800 [017..018)-> BB07(0.0914),BB06(0.909)  ( cond )                     i IBC bwd bwd-src
BB06 [0014]  1       BB04                  3.63 363 [017..018)-> BB03(1)                 (always)                     i IBC idxlen bwd
BB07 [0015]  1       BB04                  4    400 [017..032)                           (return)                     i IBC bwd
BB11 [0022]  0                             0        [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

With the new implementation, here's the initial RPO layout from above:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..009)-> BB02(1)                 (always)                     i nullcheck
BB02 [0001]  1       BB01                  1        [009..00B)-> BB04(1)                 (always)                     i keep
BB04 [0003]  2       BB02,BB03             8    800 [017..018)-> BB07(0.0914),BB06(0.909)  ( cond )                     i IBC bwd bwd-src
BB06 [0014]  1       BB04                  3.63 363 [017..018)-> BB03(1)                 (always)                     i IBC idxlen bwd
BB03 [0002]  1       BB06                  3.63 363 [00B..017)-> BB04(1)                 (always)                     i IBC loophead bwd bwd-target
BB07 [0015]  1       BB04                  4    400 [017..032)                           (return)                     i IBC bwd
BB11 [0022]  0                             0        [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Then after we fix up the backward jumps, we end up with this layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..009)-> BB02(1)                 (always)                     i nullcheck
BB02 [0001]  1       BB01                  1        [009..00B)-> BB04(1)                 (always)                     i keep
BB03 [0002]  1       BB06                  3.63 363 [00B..017)-> BB04(1)                 (always)                     i IBC loophead bwd bwd-target
BB04 [0003]  2       BB02,BB03             8    800 [017..018)-> BB07(0.0914),BB06(0.909)  ( cond )                     i IBC bwd bwd-src
BB06 [0014]  1       BB04                  3.63 363 [017..018)-> BB03(1)                 (always)                     i IBC idxlen bwd
BB07 [0015]  1       BB04                  4    400 [017..032)                           (return)                     i IBC bwd
BB11 [0022]  0                             0        [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

So we're back to square one. We can't use fgOptimizeBranchToEmptyUnconditional to get rid of BB06 because it isn't empty. However, after setting up the RPO-based layout, there is an opportunity to compact BB06 and BB03, and then move the loop head to the top of the loop as usual. This would allow BB04 to fall through upon loop exit. That might seem like an odd transformation to make during layout instead of in fgUpdateFlowGraph, but it does unlock some branch removal opportunities, and we have TP to spare. If I add a pass to compact blocks after getting the RPO-based layout, here's how the new final layout looks:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..009)-> BB02(1)                 (always)                     i nullcheck
BB02 [0001]  1       BB01                  1        [009..00B)-> BB04(1)                 (always)                     i keep
BB06 [0014]  1       BB04                  3.63 363 [00B..018)-> BB04(1)                 (always)                     i IBC idxlen bwd
BB04 [0003]  2       BB02,BB06             8    800 [017..018)-> BB07(0.0914),BB06(0.909)  ( cond )                     i IBC bwd bwd-src
BB07 [0015]  1       BB04                  4    400 [017..032)                           (return)                     i IBC bwd
BB11 [0022]  0                             0        [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And final codegen:

; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T07] (  3,  3   )     ref  ->  rcx         class-hnd single-def <System.Collections.Generic.List`1[int]>
;  V01 loc0         [V01,T04] (  4, 12.27)     int  ->  rax
;* V02 loc1         [V02    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2         [V03    ] (  0,  0   )     int  ->  zero-ref
;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2         [V06,T06] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;* V07 tmp3         [V07    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
;  V08 tmp4         [V08,T03] (  3, 12.63)     ref  ->  rcx         single-def "field V02._list (fldOffset=0x0)" P-INDEP
;  V09 tmp5         [V09,T00] (  6, 23.54)     int  ->  rdx         "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6         [V10,T11] (  0,  0   )     int  ->  zero-ref    single-def "field V02._version (fldOffset=0xc)" P-INDEP
;  V11 tmp7         [V11,T05] (  2,  7.27)     int  ->   r8         "field V02._current (fldOffset=0x10)" P-INDEP
;  V12 tmp8         [V12,T08] (  3,  3   )     ref  ->  rcx         single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9         [V13,T12] (  0,  0   )     int  ->  zero-ref    single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10        [V14,T09] (  0,  0   )     int  ->  zero-ref    single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11        [V15    ] (  0,  0   )     int  ->  zero-ref    single-def "field V05._current (fldOffset=0x10)" P-INDEP
;  V16 tmp12        [V16,T01] (  3, 21.81)     ref  ->   r8         "arr expr"
;* V17 cse0         [V17,T10] (  0,  0   )     int  ->  zero-ref    "CSE #01: aggressive"
;  V18 cse1         [V18,T02] (  2, 16   )     int  ->   r8         "CSE #02: aggressive"
;
; Lcl frame size = 40

G_M40154_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x14]
       xor      edx, edx
       jmp      SHORT G_M40154_IG04
       align    [3 bytes for IG03]
                                                ;; size=12 bbWeight=1 PerfScore 4.50
G_M40154_IG03:  ;; offset=0x0010
       mov      r8, gword ptr [rcx+0x08]
       cmp      edx, dword ptr [r8+0x08]
       jae      SHORT G_M40154_IG06
       mov      r10d, edx
       mov      r8d, dword ptr [r8+4*r10+0x10]
       inc      edx
       add      eax, r8d
                                                ;; size=23 bbWeight=3.63 PerfScore 31.80
G_M40154_IG04:  ;; offset=0x0027
       mov      r8d, dword ptr [rcx+0x10]
       cmp      edx, r8d
       jb       SHORT G_M40154_IG03
                                                ;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG05:  ;; offset=0x0030
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG06:  ;; offset=0x0035
       call     CORINFO_HELP_RNGCHKFAIL
       int3
                                                ;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 59, prolog size 4, PerfScore 67.55, instruction count 20, allocated bytes for code 59 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================

I think this warrants a follow-up to #102461.

While unrelated to this issue, we don't handle the BBJ_COND to empty BBJ_ALWAYS case with fgOptimizeBranchToEmptyUnconditional just yet, due to the old block layout's dependence on implicit fallthrough from conditional blocks into their false targets. Once #102343 is merged, I think we should pursue that in fgUpdateFlowGraph.