Closed performanceautofiler[bot] closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | performanceautofiler[bot] |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `tenet-performance-benchmarks`, `area-CodeGen-coreclr`, `refs/heads/main`, `x64`, `ubuntu 18.04`, `Regression`, `RunKind=micro`, `CoreClr` |
Milestone: | - |
A few detail charts from this issue:
And subset of similar charts from the other regressions tagged here.
https://github.com/dotnet/perf-autofiling-issues/issues/3354 (ubuntu x64)
https://github.com/dotnet/perf-autofiling-issues/issues/3364 (windows x64; 7 tests)
https://github.com/dotnet/perf-autofiling-issues/issues/3378 (windows x86; 11tests)
https://github.com/dotnet/perf-autofiling-issues/issues/3379 (windows x86; 3 tests)
https://github.com/dotnet/perf-autofiling-issues/issues/3392 (windows x64)
https://github.com/dotnet/perf-autofiling-issues/issues/3406 (windows x64; 2 tests; full pgo)
Started looking at some of the windows x64 issues.
For System.IO.Tests.Perf_FileStream.SeekForward
one oddity is that it runs much faster when run as admin then under my user account. Profiling the "fast" admin variant shows a lot of time spent in the OS, I assume this overhead is even is worse for user mode.
@DrewScoggins do we run the perf tests on Windows as admin?
Local runs (as admin) don't show the regression Here after is 3fc8b099443a8b84bd158b8e49e38be6dd941180 and before is 0c3a6ee42276d9ba68954eb00f55c46ddf27fb39 (just before forward sub).
Method | Toolchain | fileSize | options | Mean | Error | StdDev | Median |
---|---|---|---|---|---|---|---|
SeekForward | before\corerun.exe | 1024 | None | 31.08 us | 0.407 us | 0.381 us | 31.03 us |
SeekForward | after\corerun.exe | 1024 | None | 31.22 us | 0.384 us | 0.359 us | 31.21 us |
SeekForward | before\corerun.exe | 1024 | Asynchronous | 35.27 us | 0.469 us | 0.438 us | 35.26 us |
SeekForward | after\corerun.exe | 1024 | Asynchronous | 34.98 us | 0.348 us | 0.308 us | 35.06 us |
Running under my user account times are more like 55 us.
A 15%ish regression like the perf lab sees would mean a huge regression in some of the jitted methods (given jitted code is running for perhaps 30% of the overall benchmark interval). This kind of regression seems unlikely.
A wider view of the perf history shows this benchmark seems to be bimodal, but with long stretches of stability.
Similar things seem to apply for the other IO benchmarks reported above.
The x86 regression from https://github.com/dotnet/perf-autofiling-issues/issues/3378 in System.Tests.Perf_Random.NextBytes_unseeded
repros locally:
Method | Toolchain | Mean | Error | StdDev | Median |
---|---|---|---|---|---|
NextBytes_unseeded | before\corerun.exe | 1.023 us | 0.0084 us | 0.0070 us | 1.025 us |
NextBytes_unseeded | after\corerun.exe | 1.260 us | 0.0024 us | 0.0020 us | 1.259 us |
Looks like all the time here is in one method:
Diff is an extra spilled CSE. Kind of pointless to CSE a load and then spill the value just loaded and reload the spill, of the original locationt hasn't been modified.
Note the other CSEs are also spilled. Would guess we are somewhat too aggressive with CSEs on x86, where there are so few allocatable registers.
;; BEFORE
; V26 cse0 [V26,T14] ( 3, 1.50) int -> [ebp-2CH] spill-single-def "CSE - conservative"
; V27 cse1 [V27,T06] ( 4, 6.50) int -> [ebp-30H] spill-single-def "CSE - aggressive"
837D0C04 cmp dword ptr [ebp+0CH], 4
0F82B8000000 jb G_M41150_IG10
8B4508 mov eax, bword ptr [ebp+08H]
8945CC mov bword ptr [ebp-34H], eax
8B450C mov eax, dword ptr [ebp+0CH]
;; AFTER
; V26 cse0 [V26,T13] ( 3, 1.50) int -> [ebp-2CH] spill-single-def "CSE - moderate"
; V27 cse1 [V27,T06] ( 4, 6.50) int -> [ebp-30H] spill-single-def "CSE - aggressive"
; V28 cse2 [V28,T14] ( 3, 1.50) int -> [ebp-34H] spill-single-def "CSE - moderate"
8B450C mov eax, dword ptr [ebp+0CH]
8945CC mov dword ptr [ebp-34H], eax
83F804 cmp eax, 4
0F82B3000000 jb G_M41150_IG10
8B4508 mov eax, bword ptr [ebp+08H]
8945C8 mov bword ptr [ebp-38H], eax
8B45CC mov eax, dword ptr [ebp-34H]
Also somewhat oddly the checked runtime runs this test about 25% faster than release does. Guessing some of the checked corelib goo actually blocks us from these CSEs. Will look when I have time.
The diff here is just one fwd sub; V14
is removed in the after picture:
;; BEFORE
***** BB02
STMT00046 ( ??? ... ??? )
[000218] -A---+------ * ASG int
[000217] D----+-N---- +--* LCL_VAR int V14 tmp3
[000038] -----+------ \--* MUL int
[000210] -----+------ +--* ROL int
[000198] -----+------ | +--* LCL_VAR int V13 tmp2
[000201] -----+------ | \--* CNS_INT int 7
[000037] -----+------ \--* CNS_INT int 9
***** BB02
STMT00045 ( INL03 @ 0x000[E-] ... ??? ) <- INLRT @ ???
[000216] -A-XG+------ * ASG int
[000215] U--X-+-N---- +--* IND int
[000035] -----+------ | \--* LCL_VAR byref V21 tmp10
[000214] -----+------ \--* LCL_VAR int V14 tmp3
;; AFTER
***** BB02
STMT00045 ( INL03 @ 0x000[E-] ... ??? ) <- INLRT @ ???
[000216] -A-XG+------ * ASG int
[000215] U--X-+-N---- +--* IND int
[000035] -----+------ | \--* LCL_VAR byref V21 tmp10
[000038] -----+------ \--* MUL int
[000210] -----+------ +--* ROL int
[000198] -----+------ | +--* LCL_VAR int V13 tmp2
[000201] -----+------ | \--* CNS_INT int 7
[000037] -----+------ \--* CNS_INT int 9
This ends up making CSE more aggressive, because it reduces the CSE-computed enregCount
and hence reduces the moderate promotion cutoff:
;; BEFORE
Aggressive CSE Promotion cutoff is 800.000000
Moderate CSE Promotion cutoff is 200.000000
enregCount is 20
Framesize estimate is 0x002C
We have a small frame
Considering CSE #01 {$21a, $3 } [def=50.000000, use=50.000000, cost= 3 ]
CSE Expression :
N001 ( 3, 4) CSE #01 (def)[000236] ------------ * LCL_FLD int V01 arg1 u:3[+4] Fseq[_length] $21a
Conservative CSE Promotion (not enregisterable) (150.000000 < 200.000000) ;; * should print "enregisterable"
cseRefCnt=150.000000, aggressiveRefCnt=800.000000, moderateRefCnt=200.000000
defCnt=50.000000, useCnt=50.000000, cost=3, size=4
def_cost=2, use_cost=2, extra_no_cost=4, extra_yes_cost=0
CSE cost savings check (154.000000 >= 200.000000) fails
;; AFTER
Aggressive CSE Promotion cutoff is 800.000000
Moderate CSE Promotion cutoff is 150.000000
enregCount is 19
Framesize estimate is 0x0028
We have a small frame
Moderate CSE Promotion (CSE never live at call) (150.000000 >= 150.000000)
cseRefCnt=150.000000, aggressiveRefCnt=800.000000, moderateRefCnt=150.000000
defCnt=50.000000, useCnt=50.000000, cost=3, size=4
def_cost=2, use_cost=1, extra_no_cost=6, extra_yes_cost=0
CSE cost savings check (156.000000 >= 150.000000) passes
Promoting CSE:
lvaGrabTemp returning 28 (V28 rat0) (a long lifetime temp) called for CSE - moderate.
CSE #01 is single-def, so associated CSE temp V28 will be in SSA
New refCnts for V28: refCnt = 2, refCntWtd = 1
New refCnts for V28: refCnt = 3, refCntWtd = 1.50
In reality nothing has changed, and we should not have altered our CSE choices.
So arguably this is a blemish in the CSE accounting; one can't simply add up the total number of tracked vars and get anything that's meaningful as far as register pressure goes -- you need to at least make some accounting for the patterns of def and use and interference.
Not clear there's any easy fix here. Possible remediations:
The other interesting x86 regression from https://github.com/dotnet/perf-autofiling-issues/issues/3378 is System.Tests.Perf_DateTimeOffset.op_Subtraction
This also repros locally
Method | Toolchain | Mean | Error | StdDev | Median |
---|---|---|---|---|---|
op_Subtraction | after\corerun.exe | 4.207 ns | 0.0813 ns | 0.0679 ns | 4.236 ns |
op_Subtraction | before\corerun.exe | 3.348 ns | 0.0487 ns | 0.0432 ns | 3.341 ns |
This test spends most of its time in the test method
Looking at the codegen:
; Perf_DateTimeOffset:op_Subtraction():TimeSpan:this
;; BEFORE
8D74241C lea esi, bword ptr [esp+1CH]
8B3E mov edi, dword ptr [esi]
8B7604 mov esi, dword ptr [esi+4]
897C2408 mov dword ptr [esp+08H], edi
8974240C mov dword ptr [esp+0CH], esi
8B742408 mov esi, dword ptr [esp+08H]
8B7C240C mov edi, dword ptr [esp+0CH]
83E6FF and esi, -1
81E7FFFFFF3F and edi, 0x3FFFFFFF
83CE00 or esi, 0
81CF00000040 or edi, 0x40000000
83E0FF and eax, -1
81E1FFFFFF3F and ecx, 0x3FFFFFFF
;; AFTER
89442418 mov dword ptr [esp+18H], eax
894C241C mov dword ptr [esp+1CH], ecx
8D44242C lea eax, bword ptr [esp+2CH]
8B08 mov ecx, dword ptr [eax]
8B4004 mov eax, dword ptr [eax+4]
894C2410 mov dword ptr [esp+10H], ecx
89442414 mov dword ptr [esp+14H], eax
8B442410 mov eax, dword ptr [esp+10H]
8B4C2414 mov ecx, dword ptr [esp+14H]
83E0FF and eax, -1
81E1FFFFFF3F and ecx, 0x3FFFFFFF
83C800 or eax, 0
81C900000040 or ecx, 0x40000000
89442408 mov dword ptr [esp+08H], eax
894C240C mov dword ptr [esp+0CH], ecx
8B442418 mov eax, dword ptr [esp+18H]
8B4C241C mov ecx, dword ptr [esp+1CH]
83E0FF and eax, -1
81E1FFFFFF3F and ecx, 0x3FFFFFFF
8B742408 mov esi, dword ptr [esp+08H]
8B7C240C mov edi, dword ptr [esp+0CH]
Suspect forward sub is leading to sub-optimal long decomposition.
Finding it hard to keep track of what's been reported and what's been investigated, so here's a scoreboard of sorts.
We have not seen the arm64 autofiling results yet, so expect this table to grow.
From @MattGal
linux/mac it's a user that has password-less sudo permissions but is not inherently root On Windows, UAC is turned down to the point where no dialogs can pop and the user is in the admin group
Looks like forward sub alters behavior of System.Collections.Sort<IntStruct>.List(Size: 512)
-- it was stable and is now bimodal; the bimodal "best" is an improvement, the "worst" a regression.
The first group of tests on the scoreboard (ubuntu x64) all look like real regressions, so will prioritize looking at those over the x86 regressions.
I'm not able to repro these on my venerable i7-2720QM (non-AVX2) -- generally the forward sub versions run faster.
System.Memory.Span
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
IndexOfValue | Job-SZAEDM | after/corerun | 512 | 17.01 ns | 0.044 ns | 0.039 ns | 17.00 ns | 16.98 ns | 17.11 ns | 1.00 | - | NA |
IndexOfValue | Job-YTWKTE | before/corerun | 512 | 22.01 ns | 0.049 ns | 0.043 ns | 22.00 ns | 21.95 ns | 22.09 ns | 1.29 | - | NA |
System.Memory.Span
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
StartsWith | Job-CFHQCN | after/corerun | 512 | 18.67 ns | 0.042 ns | 0.037 ns | 18.66 ns | 18.62 ns | 18.74 ns | 1.00 | - | NA |
StartsWith | Job-WCMVZL | before/corerun | 512 | 19.49 ns | 0.042 ns | 0.037 ns | 19.48 ns | 19.43 ns | 19.57 ns | 1.04 | - | NA |
System.Buffers.Text.Tests.Utf8ParserTests.TryParseDateTimeOffset
Method | Job | Toolchain | value | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TryParseDateTimeOffset | Job-WQABYK | after/corerun | 12/30/2017 3:45:22 AM -08:00 | 15.94 ns | 0.053 ns | 0.050 ns | 15.91 ns | 15.87 ns | 16.03 ns | 1.00 | - | NA |
TryParseDateTimeOffset | Job-WGPIQB | before/corerun | 12/30/2017 3:45:22 AM -08:00 | 15.58 ns | 0.083 ns | 0.070 ns | 15.55 ns | 15.51 ns | 15.71 ns | 0.98 | - | NA |
System.Collections.AddGivenSize
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dictionary | Job-LALXSC | after/corerun | 512 | 19.80 us | 0.312 us | 0.292 us | 19.67 us | 19.53 us | 20.30 us | 1.00 | 0.00 | 4.6209 | 14.38 KB | 1.00 |
Dictionary | Job-IGUESQ | before/corerun | 512 | 20.23 us | 0.082 us | 0.077 us | 20.22 us | 20.13 us | 20.35 us | 1.02 | 0.02 | 4.6206 | 14.38 KB | 1.00 |
System.Text.Json.Tests.Perf_Basic.WriteBasicUtf16(Formatted: False, SkipValidation: False, DataSize: 100000)
Method | Job | Toolchain | Formatted | SkipValidation | DataSize | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WriteBasicUtf16 | Job-XSSZUU | after/corerun | False | False | 100000 | 2,777,998.8 ns | 7,072.00 ns | 6,269.14 ns | 2,775,991.4 ns | 2,769,001.1 ns | 2,791,127.0 ns | 1.00 | 0.00 | - | 129 B | 1.00 |
WriteBasicUtf16 | Job-XFHXFF | before/corerun | False | False | 100000 | 2,829,192.7 ns | 8,179.81 ns | 7,651.40 ns | 2,830,676.5 ns | 2,820,123.0 ns | 2,844,626.1 ns | 1.02 | 0.00 | - | 129 B | 1.00 |
System.Collections.CreateAddAndClear
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen 0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IDictionary | Job-AOAJQO | after/corerun | 512 | 29.98 us | 0.085 us | 0.071 us | 29.97 us | 29.90 us | 30.11 us | 1.00 | 15.2059 | 46.97 KB | 1.00 |
IDictionary | Job-NLSTVE | before/corerun | 512 | 30.09 us | 0.200 us | 0.167 us | 30.06 us | 29.93 us | 30.50 us | 1.00 | 15.2672 | 46.97 KB | 1.00 |
Will look around for some more modern HW to use.
Now using WSL2 on an Core i7-8700. I can repro the WriteBasicUtf16
slowdown:
Method | Toolchain | Formatted | SkipValidation | DataSize | Mean | Median | Ratio |
---|---|---|---|---|---|---|---|
WriteBasicUtf16 | after/corerun | False | False | 100000 | 1.854 ms | 1.853 ms | 1.07 |
WriteBasicUtf16 | before/corerun | False | False | 100000 | 1.731 ms | 1.727 ms | 1.00 |
Profiling shows this should be dominated by the codegen in Utf8JsonWriter.WriteNumberValueMinimized(long)
but this method has just one minimal diff on ubuntu x64:
;; BEFORE
G_M57270_IG02: ;; offset=0019H
488D7B38 lea rdi, bword ptr [rbx+56]
8B7F0C mov edi, dword ptr [rdi+12]
;; AFTER
G_M57270_IG02: ;; offset=0019H
8B7B44 mov edi, dword ptr [rbx+68]
Nominally this diff is an improvement, however it has the unfortunate effect of mis-aligning the first two branches in this method, so that they are impacted by the JCC erratum:
;; BEFORE
G_M57270_IG02: ;; offset=0019H
488D7B38 lea rdi, bword ptr [rbx+56]
8B7F0C mov edi, dword ptr [rdi+12]
; ............................... 32B boundary ...............................
2B7B2C sub edi, dword ptr [rbx+44]
83FF15 cmp edi, 21
7D0D jge SHORT G_M57270_IG04
;; bbWeight=1 PerfScore 6.75
G_M57270_IG03: ;; offset=0028H
488BFB mov rdi, rbx
BE15000000 mov esi, 21
E84B5AFFFF call Utf8JsonWriter:Grow(int):this
;; bbWeight=0.50 PerfScore 0.75
G_M57270_IG04: ;; offset=0035H
4C8D7B38 lea r15, bword ptr [rbx+56]
4533E4 xor r12d, r12d
4533ED xor r13d, r13d
498B3F mov rdi, gword ptr [r15]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 2) 32B boundary ...............................
4885FF test rdi, rdi
743D je SHORT G_M57270_IG07
;; AFTER
G_M57270_IG02: ;; offset=0019H
8B7B44 mov edi, dword ptr [rbx+68]
2B7B2C sub edi, dword ptr [rbx+44]
83FF15 cmp edi, 21
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 2 ; jcc erratum) 32B boundary ...............................
7D0D jge SHORT G_M57270_IG04
;; bbWeight=1 PerfScore 6.25
G_M57270_IG03: ;; offset=0024H
488BFB mov rdi, rbx
BE15000000 mov esi, 21
E8475BFFFF call Utf8JsonWriter:Grow(int):this
;; bbWeight=0.50 PerfScore 0.75
G_M57270_IG04: ;; offset=0031H
4C8D7B38 lea r15, bword ptr [rbx+56]
4533E4 xor r12d, r12d
4533ED xor r13d, r13d
498B3F mov rdi, gword ptr [r15]
4885FF test rdi, rdi
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (test: 1 ; jcc erratum) 32B boundary ...............................
743D je SHORT G_M57270_IG07
;; bbWeight=1 PerfScore 4.25
G_M57270_IG05: ;; offset=0043H
and my best guess is that this is what leads to the regression. Note the very similar code path on Windows did not regress, and I did not observe this regression (with likely very similar code; haven't verified) on older hardware.
Also note lower in this method there is a loop that we don't detect and don't align (either before or after) perhaps because it is profile cold:
;; AFTER
G_M57270_IG24: ;; offset=01FAH
488D4A30 lea rcx, [rdx+48]
49B9CDCCCCCCCCCCCCCC mov r9, 0xCCCCCCCCCCCCCCCD
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 8) 32B boundary ...............................
488BC2 mov rax, rdx
49F7E1 mul rdx:rax, r9
48C1EA03 shr rdx, 3
3BFE cmp edi, esi
0F8320030000 jae G_M57270_IG59
8BC7 mov eax, edi
4C8D0C92 lea r9, [rdx+4*rdx]
; ............................... 32B boundary ...............................
4D03C9 add r9, r9
492BC9 sub rcx, r9
41880C00 mov byte ptr [r8+rax], cl
FFCF dec edi
85FF test edi, edi
7FCA jg SHORT G_M57270_IG24
;; bbWeight=0 PerfScore 0.00
(4/29) this test is still showing regression
System.Memory.Span<Byte>.IndexOfValue(Size: 512)
has partially recovered some of the losses.
I don't see any codegen diffs in the key method SpanHelpers:IndexOfValueType(byref,int,int):int
. So perhaps data alignment?
(4/29) has improved a bit more.
I get erratic results locally for System.Memory.Span<Int32>.StartsWith
-- here the key method is SpanHelpers:SequenceEqual
. There are "good" diffs from forward sub that lead to shifting where the JCC errata hits.... suspect this is what leads to regressions. Not sure why performance fluctuates locally, though.
;; Assembly listing for method SpanHelpers:SequenceEqual(byref,byref,long):bool
;; BEFORE
G_M37173_IG12: ;; offset=0082H
C5FD100407 vmovupd ymm0, ymmword ptr[rdi+rax]
C5FD100C06 vmovupd ymm1, ymmword ptr[rsi+rax]
C5FD74C1 vpcmpeqb ymm0, ymm0, ymm1
C5FDD7C8 vpmovmskb ecx, ymm0
83F9FF cmp ecx, -1
0F85A2000000 jne G_M37173_IG18
4883C020 add rax, 32
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 1) 32B boundary ...............................
483BD0 cmp rdx, rax
77DC ja SHORT G_M37173_IG12
;; AFTER
G_M37173_IG12: ;; offset=0082H
C5FD100407 vmovupd ymm0, ymmword ptr[rdi+rax]
C5FD740406 vpcmpeqb ymm0, ymm0, ymmword ptr[rsi+rax]
C5FDD7C8 vpmovmskb ecx, ymm0
83F9FF cmp ecx, -1
0F8593000000 jne G_M37173_IG18
4883C020 add rax, 32
483BD0 cmp rdx, rax
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
77E0 ja SHORT G_M37173_IG12
Perf lab shows some fluctuations too:
(4/29) perf is back to where it was
I suspect System.Buffers.Text.Tests.Utf8ParserTests.TryParseDateTimeOffset(value: 12/30/2017 3:45:22 AM -08:00)
does not run properly on WSL2 as the numbers I get are 15x higher than the lab (and don't show any regressions)
(update 4/29) this seems to have recovered all this missing perf
Given the small number of residual regressions (just one on x64 that I can see) and the large number of improvements, I'm going to close this one.
Improvements:
Status
See scoreboard (below) for details on all the regressions reported here and in linked issues.
Run Information
Regressions in System.Memory.Span<Byte>
Test Report
Repro
Run Information
Regressions in System.Memory.Span<Int32>
Test Report
Repro
Run Information
Regressions in System.Buffers.Text.Tests.Utf8ParserTests
Test Report
Repro
Run Information
Regressions in System.Collections.AddGivenSize<String>
Test Report
Repro
Run Information
Regressions in System.Globalization.Tests.StringSearch
Test Report
Repro
Run Information
Regressions in System.Text.Json.Tests.Perf_Basic
Test Report
Repro
Run Information
Regressions in System.Collections.CreateAddAndClear<String>
Test Report
Repro