Closed adamsitnik closed 2 years ago
Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | - |
Labels: | `area-System.Collections`, `tenet-performance` |
Milestone: | - |
https://github.com/dotnet/runtime/pull/59287 is locked so doesn't get cross linked. That seems unfortunate.
That change should have purely impacted jit diagnostics, so it's unlikely to have caused regressions.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr` |
Milestone: | 7.0.0 |
Digging through it looks like we expected this to be resolved -- see https://github.com/dotnet/perf-autofiling-issues/issues/1501#issuecomment-926027832
But that only fixed issues on Windows, Ubuntu did not benefit. So we still have a regression.
(Windows is slightly worse off too)
Looks like this is still unassigned. I'll take it for now.
Can reproduce running locally (via wsl2)
BenchmarkDotNet=v0.13.1.1823-nightly, OS=ubuntu 20.04
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22408.1
[Host] : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT
Job-CFAJOE : .NET 5.0.1 (5.0.120.57516), X64 RyuJIT
Job-JPHJBC : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
Job-KPSCOL : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 InvocationCount=5000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 MinWarmupIterationCount=6
UnrollFactor=1 WarmupCount=-1
Method | Job | Runtime | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LinqQuery | Job-CFAJOE | .NET 5.0 | net5.0 | 512 | 56.80 us | 0.729 us | 0.682 us | 56.79 us | 55.69 us | 58.34 us | 1.00 | 0.00 | 5.4000 | 0.4000 | 34.33 KB | 1.00 |
LinqQuery | Job-JPHJBC | .NET 6.0 | net6.0 | 512 | 58.25 us | 0.707 us | 0.662 us | 57.97 us | 57.41 us | 59.44 us | 1.03 | 0.02 | 5.4000 | 0.4000 | 34.33 KB | 1.00 |
LinqQuery | Job-KPSCOL | .NET 7.0 | net7.0 | 512 | 72.44 us | 1.321 us | 1.235 us | 72.28 us | 70.45 us | 74.83 us | 1.28 | 0.03 | 5.6000 | 0.6000 | 34.33 KB | 1.00 |
@adamsitnik is it expected that with -p EP
I won't get cpu sample events? If so, any way to enable these via the command line?
Hmm, I guess there are sample events but not ones that perfview recognizes?
From the above I can get a crude profile of sorts. But not sure it is helping me spot which method(s) have regressed.
I guess there are sample events but not ones that perfview recognizes?
In case of EventPipe we just get different CPU samples (events emitted by the .NET Runtime, not the OS). In PerfView you need to open the "Thread Time" view (not "CPU Stacks" like usual):
Or you can take the .speedscope
file generated by BDN:
Exported 1 trace file(s). Example:
D:\projects\performance\artifacts\bin\MicroBenchmarks\Release\net7.0\BenchmarkDotNet.Artifacts\System.Collections.Sort_BigStruct_.LinqQuery(Size_ 512)-20220809-091754.speedscope.json
and open it with speedscope
Still didn't find that very helpful. But here's perf (via WSL2) on the two:
If this is credible then the issue is in this bit of code.
;; 6.0
; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
; V01 arg1 [V01,T03] ( 2, 1.36) struct (32) [rbp+10H] do-not-enreg[SF] ld-addr-op single-def
; V02 arg2 [V02,T04] ( 1, 1 ) struct (32) [rbp+30H] do-not-enreg[SB] single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (32) [rbp-20H] do-not-enreg[SFB] "Inlining Arg"
; V05 tmp2 [V05,T02] ( 4, 1.50) int -> rax "Inline return value spill temp"
; V06 tmp3 [V06,T00] ( 3, 4.71) int -> rax "Inlining Arg"
;
; Lcl frame size = 32
G_M25642_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
C5F877 vzeroupper
488D6C2420 lea rbp, [rsp+20H]
;; bbWeight=1 PerfScore 2.75
G_M25642_IG02: ;; offset=000DH
C5FA6F4530 vmovdqu xmm0, xmmword ptr [rbp+30H]
C5FA7F45E0 vmovdqu xmmword ptr [rbp-20H], xmm0
C5FA6F4540 vmovdqu xmm0, xmmword ptr [rbp+40H]
C5FA7F45F0 vmovdqu xmmword ptr [rbp-10H], xmm0
8B45EC mov eax, dword ptr [rbp-14H]
39451C cmp dword ptr [rbp+1CH], eax
7C14 jl SHORT G_M25642_IG07
;; bbWeight=1 PerfScore 7.00
G_M25642_IG03: ;; offset=0029H
39451C cmp dword ptr [rbp+1CH], eax
7F08 jg SHORT G_M25642_IG06
;; bbWeight=0.36 PerfScore 0.71
G_M25642_IG04: ;; offset=002EH
33C0 xor eax, eax
;; bbWeight=0.26 PerfScore 0.06
G_M25642_IG05: ;; offset=0030H
4883C420 add rsp, 32
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.75
G_M25642_IG06: ;; offset=0036H
B801000000 mov eax, 1
EBF3 jmp SHORT G_M25642_IG05
;; bbWeight=0.10 PerfScore 0.22
G_M25642_IG07: ;; offset=003DH
B8FFFFFFFF mov eax, -1
EBEC jmp SHORT G_M25642_IG05
;; bbWeight=0.14 PerfScore 0.32
versus
;; 7.0
; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
; V01 arg1 [V01,T03] ( 2, 1.35) struct (32) [rbp+10H] do-not-enreg[SF] ld-addr-op single-def
; V02 arg2 [V02,T04] ( 1, 1 ) struct (32) [rbp+30H] do-not-enreg[S] single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (32) [rbp-20H] do-not-enreg[SF] "Inlining Arg"
; V05 tmp2 [V05,T02] ( 4, 1.50) int -> rax "Inline return value spill temp"
; V06 tmp3 [V06,T00] ( 3, 4.70) int -> rax "Inlining Arg"
;
; Lcl frame size = 32
G_M25642_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
C5F877 vzeroupper
488D6C2420 lea rbp, [rsp+20H]
;; size=13 bbWeight=1 PerfScore 2.75
G_M25642_IG02: ;; offset=000DH
C5FE6F4530 vmovdqu ymm0, ymmword ptr[rbp+30H]
C5FE7F45E0 vmovdqu ymmword ptr[rbp-20H], ymm0
8B45EC mov eax, dword ptr [rbp-14H]
39451C cmp dword ptr [rbp+1CH], eax
7C17 jl SHORT G_M25642_IG07
;; size=18 bbWeight=1 PerfScore 9.00
G_M25642_IG03: ;; offset=001FH
39451C cmp dword ptr [rbp+1CH], eax
7F0B jg SHORT G_M25642_IG06
;; size=5 bbWeight=0.35 PerfScore 1.06
G_M25642_IG04: ;; offset=0024H
33C0 xor eax, eax
;; size=2 bbWeight=0.25 PerfScore 0.06
G_M25642_IG05: ;; offset=0026H
C5F877 vzeroupper
4883C420 add rsp, 32
5D pop rbp
C3 ret
;; size=9 bbWeight=1 PerfScore 2.75
G_M25642_IG06: ;; offset=002FH
B801000000 mov eax, 1
EBF0 jmp SHORT G_M25642_IG05
;; size=7 bbWeight=0.10 PerfScore 0.22
G_M25642_IG07: ;; offset=0036H
B8FFFFFFFF mov eax, -1
EBE9 jmp SHORT G_M25642_IG05
;; size=7 bbWeight=0.15 PerfScore 0.33
Note with AVX/AVX2 disabled 6 and 7 match perf (and match 6 with avx enabled)
BenchmarkDotNet=v0.13.1.1823-nightly, OS=ubuntu 20.04 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK=7.0.100-rc.1.22408.1 [Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT Job-KAQRRV : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT Job-SXOIEW : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT
EnvironmentVariables=COMPlus_EnableAVX2=0,COMPlus_EnableAVX=0 PowerPlanMode=00000000-0000-0000-0000-000000000000 InvocationCount=5000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 MinWarmupIterationCount=6 UnrollFactor=1 WarmupCount=-1
Method | Job | Runtime | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen 0 | Gen 1 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LinqQuery | Job-KAQRRV | .NET 6.0 | net6.0 | 512 | 55.18 us | 0.762 us | 0.675 us | 55.05 us | 54.16 us | 56.73 us | 1.00 | 5.4000 | 0.4000 | 34.33 KB | 1.00 |
LinqQuery | Job-SXOIEW | .NET 7.0 | net7.0 | 512 | 57.40 us | 0.461 us | 0.409 us | 57.28 us | 56.93 us | 58.04 us | 1.04 | 5.6000 | 0.6000 | 34.33 KB | 1.00 |
Going to modify the jit so I can do this per-method and see if just disabling AVX for the comparer explains the perf loss.
Looks like the regression comes from the use of YMM registers in the two hottest methods above
System.Linq.EnumerableSorter
2[BigStruct,BigStruct][System.Collections.BigStruct,System.Collections.BigStruct]:CompareAnyKeys(int,int)` System.Collections.Generic.GenericComparer
1[BigStruct][System.Collections.BigStruct]::Compare`In both cases there is a YMM store closely followed by a narrower load:
;; Compare
C5FE7F45E0 vmovdqu ymmword ptr[rbp-20H], ymm0
8B45EC mov eax, dword ptr [rbp-14H]
;; CompareAnyKeys
C5FE7F45C8 vmovdqu ymmword ptr[rbp-38H], ymm0
C5FA6F45C8 vmovdqu xmm0, qword ptr [rbp-38H]
On windows, there is similar codegen in Compare
but not in CompareAnyKeys
-- the latter because of ABI differences.
;; (windows) Compare
C5FE7F442408 vmovdqu ymmword ptr[rsp+08H], ymm0
8B442414 mov eax, dword ptr [rsp+14H]
Despire this, perf on windows generally seems better (around 53us). Note the store above is misaligned (as is the store in linux's CompareAnyKeys
) if that matters.
Also note that in Compare
the struct copy is really not needed. Seems like forward sub (or morph's copy prop) should get this case, but neither one sees the use:
;; tmp1 is single use
***** BB03
STMT00003 ( 0x010[E-] ... ??? )
[000027] -A--------- * ASG struct (copy)
[000025] D------N--- +--* LCL_VAR struct<System.Collections.BigStruct, 32> V04 tmp1
[000013] n---------- \--* OBJ struct<System.Collections.BigStruct, 32>
[000012] ----------- \--* ADDR byref
[000010] -------N--- \--* LCL_VAR struct<System.Collections.BigStruct, 32> V02 arg2
***** BB03
STMT00009 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x010[E-]
[000058] -A--------- * ASG int
[000057] D------N--- +--* LCL_VAR int V06 tmp3
[000022] ----------- \--* FIELD int _int1
[000021] ----------- \--* ADDR byref
[000020] -------N--- \--* LCL_VAR struct<System.Collections.BigStruct, 32> V04 tmp1
;; fwd sub
[000027]: no next stmt use
;; morph
In BB01 New Local Copy Assertion: V04 == V02, index = #01
fgMorphTree BB01, STMT00009 (before)
[000058] -A--------- * ASG int
[000057] D------N--- +--* LCL_VAR int V06 tmp3
[000022] ----------- \--* LCL_FLD int V04 tmp1 [+12]
Verified this is mitigated with the preliminary changes from #73719.
This is beyond the scope of what we can fix for .net7, so I think we're going to have to live with this regression.
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LinqQuery | Job-XBERYB | .net7 | 512 | 70.09 us | 0.513 us | 0.455 us | 70.11 us | 69.06 us | 70.89 us | 1.21 | 0.02 | 5.6000 | 0.6000 | 34.33 KB | 1.00 |
LinqQuery | Job-WMUOPH | #73719 | 512 | 56.01 us | 0.644 us | 0.602 us | 55.81 us | 55.26 us | 57.21 us | 0.97 | 0.02 | 5.6000 | 0.6000 | 34.33 KB | 1.00 |
LinqQuery | Job-NGYOUF | .net6 | 512 | 57.87 us | 1.093 us | 1.023 us | 57.66 us | 56.53 us | 59.62 us | 1.00 | 0.00 | 5.4000 | 0.4000 | 34.33 KB | 1.00 |
This should be fixed by https://github.com/dotnet/runtime/pull/74384.
(ubuntu x64)
This regression seems to be specific to all configs except of Windows 64 bit.
Repro:
Ubuntu Historical results
The diff points to https://github.com/dotnet/runtime/pull/55604 (cc @alexcovington) and https://github.com/dotnet/runtime/pull/59287 (cc @AndyAyersMS)
Windows Historical results