dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

[Perf] Regressions on System.Collections.IterateForEach<String> #106342

Open performanceautofiler[bot] opened 1 month ago

performanceautofiler[bot] commented 1 month ago

Run Information

Name Value
Architecture x64
OS Windows 10.0.22631
Queue ViperWindows
Baseline d3cd5b6510871b97056ed421ae84be4805f547e6
Compare 96eda8dd3e75c7e10b3b603a7cf01c9688fe9aa2
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.IterateForEach<String>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
827.40 ns 926.42 ns 1.12 0.03 False
830.92 ns 960.67 ns 1.16 0.03 False

graph graph Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.IterateForEach&lt;String&gt;*'
### System.Collections.IterateForEach<String>.HashSet(Size: 512) #### ETL Files #### Histogram #### JIT Disasms ### System.Collections.IterateForEach<String>.Dictionary(Size: 512) #### ETL Files #### Histogram #### JIT Disasms ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)
LoopedBard3 commented 1 month ago

Related regressions: Linux x64: https://github.com/dotnet/perf-autofiling-issues/issues/39775, https://github.com/dotnet/perf-autofiling-issues/issues/39746 Windows x64: https://github.com/dotnet/perf-autofiling-issues/issues/39770 Linux Arm64: https://github.com/dotnet/perf-autofiling-issues/issues/39883 (Only the IterateForEach test), https://github.com/dotnet/perf-autofiling-issues/issues/40387 Windows Arm64: https://github.com/dotnet/perf-autofiling-issues/issues/41043

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

LoopedBard3 commented 1 month ago

Likely caused by: https://github.com/dotnet/runtime/pull/106185 @jakobbotsch. Seems like this may be a necessary bug fix.

jakobbotsch commented 1 month ago

Not exactly... I will investigate for .NET 9.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 1 month ago

System.Collections.IterateForEach<String>.HashSet(Size: 512)

Hot functions:

Diffs ### ``[MicroBenchmarks]System.Collections.IterateForEach`1[System.__Canon].HashSet()`` ```diff ; optimized using Dynamic PGO ; rbp based frame ; fully interruptible -; with Dynamic PGO: fgCalledCount is 61496 +; with Dynamic PGO: fgCalledCount is 59816 ; 1 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; -; V00 this [V00,T07] ( 5, 4.00) ref -> [rbp+0x10] this class-hnd EH-live single-def -; V01 loc0 [V01,T08] ( 3, 504.54) ref -> rbx ld-addr-op class-hnd -; V02 loc1 [V02,T14] ( 2, 2 ) ref -> rdx class-hnd single-def -; V03 loc2 [V03 ] ( 18,4534.88) struct (24) [rbp-0x28] do-not-enreg[XS] must-init addr-exposed ld-addr-op -; V04 loc3 [V04,T06] ( 2,1005.08) ref -> rbx class-hnd +; V00 this [V00,T04] ( 5, 514.70) ref -> [rbp+0x10] this class-hnd EH-live single-def +; V01 loc0 [V01,T06] ( 3, 512.70) ref -> rbx ld-addr-op class-hnd +; V02 loc1 [V02,T10] ( 2, 2 ) ref -> rdx class-hnd single-def +; V03 loc2 [V03 ] ( 18,4608.29) struct (24) [rbp-0x28] do-not-enreg[XS] must-init addr-exposed ld-addr-op +; V04 loc3 [V04,T03] ( 2,1021.40) ref -> rbx class-hnd ; V05 OutArgs [V05 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V06 tmp1 [V06 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V07 tmp2 [V07 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V08 tmp3 [V08 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" -; V09 tmp4 [V09,T12] ( 6, 4.00) long -> rax "Indirect call through function pointer" +; V09 tmp4 [V09,T08] ( 6, 4.00) long -> r8 "Indirect call through function pointer" ;* V10 tmp5 [V10 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "NewObj constructor temp" ;* V11 tmp6 [V11 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V12 tmp7 [V12 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp" -; V13 tmp8 [V13,T04] ( 4,2010.27) int -> rax "Inline stloc first use temp" -; V14 tmp9 [V14,T00] ( 3,3015.40) ref -> rdx class-hnd exact "impAppendStmt" <> -; V15 tmp10 [V15,T05] ( 3,1507.67) byref -> rdx "Inline stloc first use temp" -; V16 tmp11 [V16 ] ( 7,1512.63) ref -> [rbp-0x28] do-not-enreg[X] addr-exposed "field V03._hashSet (fldOffset=0x0)" P-DEP -; V17 tmp12 [V17 ] ( 6,1008.06) ref -> [rbp-0x20] do-not-enreg[X] addr-exposed "field V03._current (fldOffset=0x8)" P-DEP -; V18 tmp13 [V18 ] ( 4, 505.54) int -> [rbp-0x18] do-not-enreg[X] addr-exposed "field V03._version (fldOffset=0x10)" P-DEP -; V19 tmp14 [V19 ] ( 7,1511.65) int -> [rbp-0x14] do-not-enreg[X] addr-exposed "field V03._index (fldOffset=0x14)" P-DEP -; V20 tmp15 [V20,T13] ( 3, 3 ) ref -> rdx single-def "field V10._hashSet (fldOffset=0x0)" P-INDEP -;* V21 tmp16 [V21,T16] ( 0, 0 ) ref -> zero-ref single-def "field V10._current (fldOffset=0x8)" P-INDEP -; V22 tmp17 [V22,T15] ( 2, 2 ) int -> rax single-def "field V10._version (fldOffset=0x10)" P-INDEP -;* V23 tmp18 [V23,T17] ( 0, 0 ) int -> zero-ref single-def "field V10._index (fldOffset=0x14)" P-INDEP -; V24 PSPSym [V24,T18] ( 1, 1 ) long -> [rbp-0x30] do-not-enreg[V] "PSPSym" -;* V25 cse0 [V25,T09] ( 0, 0 ) long -> zero-ref "CSE #03: aggressive" -;* V26 rat0 [V26,T02] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree" -;* V27 rat1 [V27,T03] ( 0, 0 ) long -> zero-ref "runtime lookup" -;* V28 rat2 [V28,T01] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable" -; V29 rat3 [V29,T11] ( 3, 4.40) long -> rdx "Spilling to split statement for tree" -; V30 rat4 [V30,T10] ( 3, 5.60) long -> rax "fgMakeTemp is creating a new local variable" -; V31 rat5 [V31,T19] ( 3, 0 ) long -> rdx "Spilling to split statement for tree" -; V32 rat6 [V32,T20] ( 3, 0 ) long -> rax "fgMakeTemp is creating a new local variable" +; V13 tmp8 [V13,T01] ( 4,2042.80) int -> r8 "Inline stloc first use temp" +; V14 tmp9 [V14,T00] ( 3,3064.20) ref -> rdx class-hnd exact "impAppendStmt" <> +; V15 tmp10 [V15,T02] ( 3,1532.10) byref -> rdx "Inline stloc first use temp" +; V16 tmp11 [V16 ] ( 7,1537.10) ref -> [rbp-0x28] do-not-enreg[X] addr-exposed "field V03._hashSet (fldOffset=0x0)" P-DEP +; V17 tmp12 [V17 ] ( 6,1024.40) ref -> [rbp-0x20] do-not-enreg[X] addr-exposed "field V03._current (fldOffset=0x8)" P-DEP +; V18 tmp13 [V18 ] ( 4, 513.70) int -> [rbp-0x18] do-not-enreg[X] addr-exposed "field V03._version (fldOffset=0x10)" P-DEP +; V19 tmp14 [V19 ] ( 7,1536.10) int -> [rbp-0x14] do-not-enreg[X] addr-exposed "field V03._index (fldOffset=0x14)" P-DEP +; V20 tmp15 [V20,T09] ( 3, 3 ) ref -> rdx single-def "field V10._hashSet (fldOffset=0x0)" P-INDEP +;* V21 tmp16 [V21,T12] ( 0, 0 ) ref -> zero-ref single-def "field V10._current (fldOffset=0x8)" P-INDEP +; V22 tmp17 [V22,T11] ( 2, 2 ) int -> rax single-def "field V10._version (fldOffset=0x10)" P-INDEP +;* V23 tmp18 [V23,T13] ( 0, 0 ) int -> zero-ref single-def "field V10._index (fldOffset=0x14)" P-INDEP +; V24 PSPSym [V24,T14] ( 1, 1 ) long -> [rbp-0x30] do-not-enreg[V] "PSPSym" +; V25 cse0 [V25,T05] ( 5, 512.90) long -> rax multi-def "CSE #02: aggressive" +; V26 rat0 [V26,T07] ( 3, 5.60) long -> r8 "fgMakeTemp is creating a new local variable" +; V27 rat1 [V27,T15] ( 3, 0 ) long -> rax "Spilling to split statement for tree" +; V28 rat2 [V28,T16] ( 3, 0 ) long -> r8 "fgMakeTemp is creating a new local variable" ; ; Lcl frame size = 72 @@ -564,37 +560,39 @@ G_M28910_IG03: mov dword ptr [rbp-0x14], edx ;; size=3 bbWeight=1 PerfScore 1.00 G_M28910_IG04: + mov rax, qword ptr [rcx] mov edx, dword ptr [rbp-0x18] - mov rax, gword ptr [rbp-0x28] - cmp edx, dword ptr [rax+0x34] + mov r8, gword ptr [rbp-0x28] + cmp edx, dword ptr [r8+0x34] jne SHORT G_M28910_IG10 - align [0 bytes for IG05] - ;; size=12 bbWeight=503.54 PerfScore 3021.25 + align [1 bytes for IG05] + ;; size=17 bbWeight=511.70 PerfScore 4221.52 G_M28910_IG05: mov edx, dword ptr [rbp-0x14] - mov rax, gword ptr [rbp-0x28] - cmp edx, dword ptr [rax+0x28] + mov r8, gword ptr [rbp-0x28] + cmp edx, dword ptr [r8+0x28] jae SHORT G_M28910_IG08 - ;; size=12 bbWeight=503.54 PerfScore 3021.25 + ;; size=13 bbWeight=511.70 PerfScore 3070.19 G_M28910_IG06: mov rdx, gword ptr [rbp-0x28] mov rdx, gword ptr [rdx+0x10] - mov eax, dword ptr [rbp-0x14] - lea r8d, [rax+0x01] - mov dword ptr [rbp-0x14], r8d - cmp eax, dword ptr [rdx+0x08] + mov r8d, dword ptr [rbp-0x14] + lea r10d, [r8+0x01] + mov dword ptr [rbp-0x14], r10d + cmp r8d, dword ptr [rdx+0x08] jae SHORT G_M28910_IG09 - shl rax, 4 - lea rdx, bword ptr [rdx+rax+0x10] + shl r8, 4 + lea rdx, bword ptr [rdx+r8+0x10] cmp dword ptr [rdx+0x0C], -1 jl SHORT G_M28910_IG05 - ;; size=39 bbWeight=502.57 PerfScore 7538.50 + ;; size=41 bbWeight=510.70 PerfScore 7660.51 G_M28910_IG07: - mov rdx, gword ptr [rdx] - mov gword ptr [rbp-0x20], rdx + mov rax, gword ptr [rdx] + mov gword ptr [rbp-0x20], rax mov rbx, gword ptr [rbp-0x20] + mov rcx, gword ptr [rbp+0x10] jmp SHORT G_M28910_IG04 - ;; size=13 bbWeight=502.54 PerfScore 3015.25 + ;; size=17 bbWeight=510.70 PerfScore 3574.89 G_M28910_IG08: mov rdx, gword ptr [rbp-0x28] mov edx, dword ptr [rdx+0x28] @@ -603,7 +601,7 @@ G_M28910_IG08: xor rdx, rdx mov gword ptr [rbp-0x20], rdx jmp SHORT G_M28910_IG11 - ;; size=20 bbWeight=0.98 PerfScore 7.32 + ;; size=20 bbWeight=1.00 PerfScore 7.49 G_M28910_IG09: call CORINFO_HELP_RNGCHKFAIL int3 @@ -613,18 +611,17 @@ G_M28910_IG10: int3 ;; size=7 bbWeight=0 PerfScore 0.00 G_M28910_IG11: - mov rdx, qword ptr [rcx] - mov rax, qword ptr [rdx+0x30] - mov rax, qword ptr [rax] - mov rax, qword ptr [rax+0x28] - test rax, rax + mov rdx, qword ptr [rax+0x30] + mov rdx, qword ptr [rdx] + mov r8, qword ptr [rdx+0x28] + test r8, r8 je SHORT G_M28910_IG14 - ;; size=19 bbWeight=1.00 PerfScore 9.25 + ;; size=16 bbWeight=1.00 PerfScore 7.25 G_M28910_IG12: lea rcx, [rbp-0x28] - call rax + call r8 mov rax, rbx - ;; size=9 bbWeight=1.00 PerfScore 3.75 + ;; size=10 bbWeight=1.00 PerfScore 3.75 G_M28910_IG13: add rsp, 72 pop rbx @@ -632,11 +629,12 @@ G_M28910_IG13: ret ;; size=7 bbWeight=1.00 PerfScore 2.25 G_M28910_IG14: - mov rcx, rdx + mov rcx, rax mov rdx, 0xD1FFAB1E ; global ptr call CORINFO_HELP_RUNTIMEHANDLE_CLASS + mov r8, rax jmp SHORT G_M28910_IG12 - ;; size=20 bbWeight=0.20 PerfScore 0.70 + ;; size=23 bbWeight=0.20 PerfScore 0.75 G_M28910_IG15: push rbp push rbx @@ -647,24 +645,25 @@ G_M28910_IG15: ;; size=19 bbWeight=0 PerfScore 0.00 G_M28910_IG16: mov rcx, gword ptr [rbp+0x10] - mov rdx, qword ptr [rcx] - mov rax, qword ptr [rdx+0x30] - mov rax, qword ptr [rax] - mov rax, qword ptr [rax+0x28] - test rax, rax + mov rax, qword ptr [rcx] + mov rdx, qword ptr [rax+0x30] + mov rdx, qword ptr [rdx] + mov r8, qword ptr [rdx+0x28] + test r8, r8 je SHORT G_M28910_IG17 jmp SHORT G_M28910_IG18 ;; size=25 bbWeight=0 PerfScore 0.00 G_M28910_IG17: - mov rcx, rdx + mov rcx, rax mov rdx, 0xD1FFAB1E ; global ptr call CORINFO_HELP_RUNTIMEHANDLE_CLASS - ;; size=18 bbWeight=0 PerfScore 0.00 + mov r8, rax + ;; size=21 bbWeight=0 PerfScore 0.00 G_M28910_IG18: lea rcx, [rbp-0x28] - call rax + call r8 nop - ;; size=7 bbWeight=0 PerfScore 0.00 + ;; size=8 bbWeight=0 PerfScore 0.00 G_M28910_IG19: add rsp, 40 pop rbx @@ -672,6 +671,6 @@ G_M28910_IG19: ret ;; size=7 bbWeight=0 PerfScore 0.00 -; Total bytes of code 303, prolog size 38, PerfScore 16637.35, instruction count 95, allocated bytes for code 303 (MethodHash=c31c8f11) for method System.Collections.IterateForEach`1[System.__Canon]:HashSet():System.__Canon:this (Tier1) +; Total bytes of code 320, prolog size 38, PerfScore 18566.43, instruction count 98, allocated bytes for code 320 (MethodHash=c31c8f11) for method System.Collections.IterateForEach`1[System.__Canon]:HashSet():System.__Canon:this (Tier1) ; ============================================================ ```

System.Collections.IterateForEach<String>.Dictionary(Size: 512)

Hot functions:

Diffs ### ``[MicroBenchmarks]System.Collections.IterateForEach`1[System.__Canon].Dictionary()`` ```diff ; optimized using Dynamic PGO ; rbp based frame ; fully interruptible -; with Dynamic PGO: fgCalledCount is 15716 +; with Dynamic PGO: fgCalledCount is 14830 ; 1 inlinees with PGO data; 5 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; -; V00 this [V00,T13] ( 5, 4.00) ref -> [rbp+0x10] this class-hnd EH-live single-def -; V01 loc0 [V01,T14] ( 3, 515.69) ref -> rbx ld-addr-op class-hnd -; V02 loc1 [V02,T19] ( 3, 3 ) ref -> rdx class-hnd single-def -; V03 loc2 [V03 ] ( 17,4121.55) struct (40) [rbp-0x38] do-not-enreg[XSF] must-init addr-exposed ld-addr-op +; V00 this [V00,T10] ( 5, 517.73) ref -> [rbp+0x10] this class-hnd EH-live single-def +; V01 loc0 [V01,T12] ( 3, 515.73) ref -> rbx ld-addr-op class-hnd +; V02 loc1 [V02,T15] ( 3, 3 ) ref -> rdx class-hnd single-def +; V03 loc2 [V03 ] ( 17,4121.81) struct (40) [rbp-0x38] do-not-enreg[XSF] must-init addr-exposed ld-addr-op ;* V04 loc3 [V04 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op ; V05 OutArgs [V05 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V06 tmp1 [V06 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V07 tmp2 [V07 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V08 tmp3 [V08 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V09 tmp4 [V09 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" -; V10 tmp5 [V10,T18] ( 6, 4.00) long -> rax "Indirect call through function pointer" -; V11 tmp6 [V11,T21] ( 1, 2 ) struct (40) [rbp-0x60] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" +; V10 tmp5 [V10,T14] ( 6, 4.00) long -> r8 "Indirect call through function pointer" +; V11 tmp6 [V11,T17] ( 1, 2 ) struct (40) [rbp-0x60] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" ;* V12 tmp7 [V12 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" ;* V13 tmp8 [V13 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp" -; V14 tmp9 [V14,T04] ( 4,2054.79) int -> rdx "Inline stloc first use temp" -; V15 tmp10 [V15,T00] ( 3,3082.18) ref -> rax class-hnd exact "impAppendStmt" <> -; V16 tmp11 [V16,T05] ( 4,2054.78) byref -> rdx "Inline stloc first use temp" +; V14 tmp9 [V14,T04] ( 4,2054.89) int -> rdx "Inline stloc first use temp" +; V15 tmp10 [V15,T00] ( 3,3082.33) ref -> r8 class-hnd exact "impAppendStmt" <> +; V16 tmp11 [V16,T03] ( 4,2054.90) byref -> rdx "Inline stloc first use temp" ;* V17 tmp12 [V17 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "NewObj constructor temp" ;* V18 tmp13 [V18 ] ( 0, 0 ) long -> zero-ref "Inlining Arg" ;* V19 tmp14 [V19 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" -; V20 tmp15 [V20,T06] ( 2,2054.78) ref -> rax class-hnd "Inlining Arg" -; V21 tmp16 [V21,T07] ( 2,2054.78) ref -> rdx class-hnd "Inlining Arg" +; V20 tmp15 [V20,T01] ( 2,2054.90) ref -> rax class-hnd "Inlining Arg" +; V21 tmp16 [V21,T02] ( 2,2054.90) ref -> rdx class-hnd "Inlining Arg" ;* V22 tmp17 [V22 ] ( 0, 0 ) ref -> zero-ref "field V04.key (fldOffset=0x0)" P-INDEP -; V23 tmp18 [V23,T10] ( 2,1027.39) ref -> rbx "field V04.value (fldOffset=0x8)" P-INDEP -; V24 tmp19 [V24,T11] ( 2,1027.39) ref -> rax "field V17.key (fldOffset=0x0)" P-INDEP -; V25 tmp20 [V25,T12] ( 2,1027.39) ref -> rdx "field V17.value (fldOffset=0x8)" P-INDEP +; V23 tmp18 [V23,T07] ( 2,1027.45) ref -> rbx "field V04.value (fldOffset=0x8)" P-INDEP +; V24 tmp19 [V24,T08] ( 2,1027.45) ref -> rax "field V17.key (fldOffset=0x0)" P-INDEP +; V25 tmp20 [V25,T09] ( 2,1027.45) ref -> rdx "field V17.value (fldOffset=0x8)" P-INDEP ;* V26 tmp21 [V26 ] ( 0, 0 ) ref -> zero-ref single-def "V11.[000..008)" -; V27 tmp22 [V27,T20] ( 2, 2 ) int -> rax single-def "V11.[008..012)" +; V27 tmp22 [V27,T16] ( 2, 2 ) int -> rax single-def "V11.[008..012)" ;* V28 tmp23 [V28 ] ( 0, 0 ) int -> zero-ref single-def "V11.[012..016)" ;* V29 tmp24 [V29 ] ( 0, 0 ) int -> zero-ref single-def "V11.[016..020)" -; V30 PSPSym [V30,T22] ( 1, 1 ) long -> [rbp-0x70] do-not-enreg[V] "PSPSym" -;* V31 cse0 [V31,T15] ( 0, 0 ) long -> zero-ref "CSE #05: aggressive" -; V32 cse1 [V32,T08] ( 4,1544.08) ref -> rax "CSE #02: aggressive" -; V33 cse2 [V33,T09] ( 3,1543.09) int -> rdx "CSE #01: aggressive" -;* V34 rat0 [V34,T02] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree" -;* V35 rat1 [V35,T03] ( 0, 0 ) long -> zero-ref "runtime lookup" -;* V36 rat2 [V36,T01] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable" -; V37 rat3 [V37,T17] ( 3, 4.40) long -> rdx "Spilling to split statement for tree" -; V38 rat4 [V38,T16] ( 3, 5.60) long -> rax "fgMakeTemp is creating a new local variable" -; V39 rat5 [V39,T23] ( 3, 0 ) long -> rdx "Spilling to split statement for tree" -; V40 rat6 [V40,T24] ( 3, 0 ) long -> rax "fgMakeTemp is creating a new local variable" +; V30 PSPSym [V30,T18] ( 1, 1 ) long -> [rbp-0x70] do-not-enreg[V] "PSPSym" +; V31 cse0 [V31,T05] ( 4,1544.18) ref -> r8 "CSE #02: aggressive" +; V32 cse1 [V32,T11] ( 5, 515.93) long -> rax multi-def "CSE #04: aggressive" +; V33 cse2 [V33,T06] ( 3,1543.17) int -> rdx "CSE #01: aggressive" +; V34 rat0 [V34,T13] ( 3, 5.60) long -> r8 "fgMakeTemp is creating a new local variable" +; V35 rat1 [V35,T19] ( 3, 0 ) long -> rax "Spilling to split statement for tree" +; V36 rat2 [V36,T20] ( 3, 0 ) long -> r8 "fgMakeTemp is creating a new local variable" ; ; Lcl frame size = 136 @@ -638,46 +634,48 @@ G_M26406_IG04: mov dword ptr [rbp-0x28], 2 ;; size=19 bbWeight=1 PerfScore 4.25 G_M26406_IG05: + mov rax, qword ptr [rcx] mov edx, dword ptr [rbp-0x30] - mov rax, gword ptr [rbp-0x38] - cmp edx, dword ptr [rax+0x44] + mov r8, gword ptr [rbp-0x38] + cmp edx, dword ptr [r8+0x44] jne SHORT G_M26406_IG11 align [0 bytes for IG06] - ;; size=12 bbWeight=514.69 PerfScore 3088.16 + ;; size=16 bbWeight=514.73 PerfScore 4117.81 G_M26406_IG06: mov edx, dword ptr [rbp-0x2C] - mov rax, gword ptr [rbp-0x38] - cmp edx, dword ptr [rax+0x38] + mov r8, gword ptr [rbp-0x38] + cmp edx, dword ptr [r8+0x38] jae SHORT G_M26406_IG09 - ;; size=12 bbWeight=514.69 PerfScore 3088.16 + ;; size=13 bbWeight=514.73 PerfScore 3088.36 G_M26406_IG07: - mov rax, gword ptr [rax+0x10] - lea r8d, [rdx+0x01] - mov dword ptr [rbp-0x2C], r8d - cmp edx, dword ptr [rax+0x08] + mov r8, gword ptr [r8+0x10] + lea r10d, [rdx+0x01] + mov dword ptr [rbp-0x2C], r10d + cmp edx, dword ptr [r8+0x08] jae SHORT G_M26406_IG10 mov edx, edx lea rdx, [rdx+2*rdx] - lea rdx, bword ptr [rax+8*rdx+0x10] + lea rdx, bword ptr [r8+8*rdx+0x10] cmp dword ptr [rdx+0x14], -1 jl SHORT G_M26406_IG06 - ;; size=34 bbWeight=513.70 PerfScore 6806.49 + ;; size=35 bbWeight=513.72 PerfScore 6806.82 G_M26406_IG08: mov rax, gword ptr [rdx] mov rdx, gword ptr [rdx+0x08] mov gword ptr [rbp-0x20], rax mov gword ptr [rbp-0x18], rdx mov rbx, gword ptr [rbp-0x18] + mov rcx, gword ptr [rbp+0x10] jmp SHORT G_M26406_IG05 - ;; size=21 bbWeight=513.69 PerfScore 4623.25 + ;; size=25 bbWeight=513.73 PerfScore 5137.26 G_M26406_IG09: - mov edx, dword ptr [rax+0x38] + mov edx, dword ptr [r8+0x38] inc edx mov dword ptr [rbp-0x2C], edx vxorps xmm0, xmm0, xmm0 vmovdqu xmmword ptr [rbp-0x20], xmm0 jmp SHORT G_M26406_IG12 - ;; size=19 bbWeight=1.00 PerfScore 6.56 + ;; size=20 bbWeight=1.00 PerfScore 6.61 G_M26406_IG10: call CORINFO_HELP_RNGCHKFAIL int3 @@ -687,18 +685,17 @@ G_M26406_IG11: int3 ;; size=7 bbWeight=0 PerfScore 0.00 G_M26406_IG12: - mov rdx, qword ptr [rcx] - mov rax, qword ptr [rdx+0x30] - mov rax, qword ptr [rax] - mov rax, qword ptr [rax+0x30] - test rax, rax + mov rdx, qword ptr [rax+0x30] + mov rdx, qword ptr [rdx] + mov r8, qword ptr [rdx+0x30] + test r8, r8 je SHORT G_M26406_IG15 - ;; size=19 bbWeight=1.00 PerfScore 9.25 + ;; size=16 bbWeight=1.00 PerfScore 7.25 G_M26406_IG13: lea rcx, [rbp-0x38] - call rax + call r8 mov rax, rbx - ;; size=9 bbWeight=1.00 PerfScore 3.75 + ;; size=10 bbWeight=1.00 PerfScore 3.75 G_M26406_IG14: vzeroupper add rsp, 136 @@ -707,11 +704,12 @@ G_M26406_IG14: ret ;; size=13 bbWeight=1.00 PerfScore 3.25 G_M26406_IG15: - mov rcx, rdx + mov rcx, rax mov rdx, 0xD1FFAB1E ; global ptr call CORINFO_HELP_RUNTIMEHANDLE_CLASS + mov r8, rax jmp SHORT G_M26406_IG13 - ;; size=20 bbWeight=0.20 PerfScore 0.70 + ;; size=23 bbWeight=0.20 PerfScore 0.75 G_M26406_IG16: push rbp push rbx @@ -722,24 +720,25 @@ G_M26406_IG16: ;; size=22 bbWeight=0 PerfScore 0.00 G_M26406_IG17: mov rcx, gword ptr [rbp+0x10] - mov rdx, qword ptr [rcx] - mov rax, qword ptr [rdx+0x30] - mov rax, qword ptr [rax] - mov rax, qword ptr [rax+0x30] - test rax, rax + mov rax, qword ptr [rcx] + mov rdx, qword ptr [rax+0x30] + mov rdx, qword ptr [rdx] + mov r8, qword ptr [rdx+0x30] + test r8, r8 je SHORT G_M26406_IG18 jmp SHORT G_M26406_IG19 ;; size=25 bbWeight=0 PerfScore 0.00 G_M26406_IG18: - mov rcx, rdx + mov rcx, rax mov rdx, 0xD1FFAB1E ; global ptr call CORINFO_HELP_RUNTIMEHANDLE_CLASS - ;; size=18 bbWeight=0 PerfScore 0.00 + mov r8, rax + ;; size=21 bbWeight=0 PerfScore 0.00 G_M26406_IG19: lea rcx, [rbp-0x38] - call rax + call r8 nop - ;; size=7 bbWeight=0 PerfScore 0.00 + ;; size=8 bbWeight=0 PerfScore 0.00 G_M26406_IG20: vzeroupper add rsp, 40 @@ -748,6 +747,6 @@ G_M26406_IG20: ret ;; size=10 bbWeight=0 PerfScore 0.00 -; Total bytes of code 348, prolog size 48, PerfScore 17657.16, instruction count 101, allocated bytes for code 348 (MethodHash=a02998d9) for method System.Collections.IterateForEach`1[System.__Canon]:Dictionary():System.__Canon:this (Tier1) +; Total bytes of code 364, prolog size 48, PerfScore 19199.44, instruction count 104, allocated bytes for code 364 (MethodHash=a02998d9) for method System.Collections.IterateForEach`1[System.__Canon]:Dictionary():System.__Canon:this (Tier1) ; ============================================================ ```

This looks like we do a new CSE that leads to different register allocation. In the baseline we have

***** BB02 [0001]
STMT00045 ( ??? ... ??? )
N004 ( 20, 19) CSE #05 (def)[000017] --CXG------                         ▌  CALL help long   CORINFO_HELP_RUNTIMEHANDLE_CLASS $402
N002 (  3,  2) CSE #04 (def)[000015] #--X------- arg0 in rcx             ├──▌  IND       long   $400
N001 (  1,  1)              [000014] !----------                         │  └──▌  LCL_VAR   ref    V00 this         u:1 $100
N003 (  3, 10)              [000016] H------N--- arg1 in rdx             └──▌  CNS_INT(h) long   0x7ffe78ac4250 global ptr $43

while in the diff we have removed this call earlier, so we end up with

STMT00045 ( ??? ... ??? )
N002 (  3,  2) CSE #04 (def)[000015] #--X-------                         ▌  IND       long   $400
N001 (  1,  1)              [000014] !----------                         └──▌  LCL_VAR   ref    V00 this         u:1 $100

Then we see CSE kicking in for the diff, but not for the base:

-Considering CSE #04 {$143, $341} [def=51469.406974, use=100.000000, cost=  3, call]
-CSE Expression : 
-N002 (  3,  2) CSE #04 (def)[000015] #--X-------                         ▌  IND       long   $400
-N001 (  1,  1)              [000014] !----------                         └──▌  LCL_VAR   ref    V00 this         u:1 $100
-
-Aggressive CSE Promotion (103038.813948 >= 600.000000)
-cseRefCnt=103038.813948, aggressiveRefCnt=600.000000, moderateRefCnt=100.000000
-defCnt=51469.406974, useCnt=100.000000, cost=3, size=2, LiveAcrossCall
-def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0
-CSE cost savings check (302.000000 >= 51569.406974) fails
-Did Not promote this CSE

+Considering CSE #04 {$143, $341} [def=51472.623061, use=51472.623061, cost=  3      ]
+CSE Expression : 
+N002 (  3,  2) CSE #04 (def)[000015] #--X-------                         ▌  IND       long   $400
+N001 (  1,  1)              [000014] !----------                         └──▌  LCL_VAR   ref    V00 this         u:1 $100
+
+Aggressive CSE Promotion (154417.869184 >= 600.000000)
+cseRefCnt=154417.869184, aggressiveRefCnt=600.000000, moderateRefCnt=100.000000
+defCnt=51472.623061, useCnt=51472.623061, cost=3, size=2
+def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0
+CSE cost savings check (154421.869184 >= 102945.246123) passes
+
+Promoting CSE:

These are the defs/uses of the CSE in the diff:

Labeling the CSEs with Use/Def information
BB02 [000015] Def of CSE #04 [weight=514.73]
BB07 [000028] Use of CSE #04 [weight=513.73]
BB09 [000223] Use of CSE #04 [weight=1.00]
BB10 [000052] Def of CSE #04 [weight=0]

It turns out that both the high weight def and use can later be optimized out (they have no uses and we end up proving they are non-faulting). That happens in the baseline. But in the diff since we introduced the CSE the high-weight def ends up being live due to the low weight use.

Not really sure what we can do here, if anything. This change is a perf fix around InlineArray, but it's quite unfortunate to take a perf regression for enumeration over dictionaries in shared generic methods as part of it... Any ideas @AndyAyersMS?

jakobbotsch commented 1 month ago

Perhaps one possibility could be to have CSE not count the weight of IND uses if the value of those uses are not used, essentially anticipating that not doing the CSE will lead to assertion prop removing that indirection anyway. Let me try that out...

AndyAyersMS commented 1 month ago

Perhaps one possibility could be to have CSE not count the weight of IND uses if the value of those uses are not used, essentially anticipating that not doing the CSE will lead to assertion prop removing that indirection anyway. Let me try that out...

Seems plausible, though you probably want to count the weight of the GTF_MAKE_CSE instances (they will be unused).

jakobbotsch commented 1 month ago

That change has a bit larger diffs than I'd like to take at this stage of .NET 9. And on perfscore it seems regressions outweigh improvements.

I also tried another change where we directly bash the uses to nops as part of availability, but it has basically the same diffs.

I think we should just accept this regression. I would prefer that we keep #106185 in since it is fixing questionable logic, and the way it leads to the regression here is certainly not because the baseline was actively trying to do something clever, rather it just happened to avoid this CSE by chance. I will move it to 10.0 since we can perhaps experiment with my change to account for the unused nodes in CSE heuristics in 10.0.

EgorBo commented 3 weeks ago

@EgorBot -intel -commit 15e96faf4558b017ea8df1dc28d9b2169f0badc0 vs previous --filter System.Collections.IterateForEach.Dictionary

EgorBot commented 3 weeks ago
Benchmark results on Intel ``` BenchmarkDotNet v0.13.13-nightly.20240311.145, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-CJBWBN : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-QYGCQS : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1 ``` | Method | Toolchain | Size | Mean | Error | Ratio | Allocated | Alloc Ratio | |----------- |------------------------ |----- |---------:|----------:|------:|----------:|------------:| | Dictionary | Before | 512 | 1.242 Ξs | 0.0006 Ξs | 1.00 | - | NA | | Dictionary | After | 512 | 1.140 Ξs | 0.0009 Ξs | 0.92 | - | NA | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_ebc4ec2a.zip)