Closed AndyAyersMS closed 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | AndyAyersMS |
---|---|
Assignees: | AndyAyersMS |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
These benchmarks were analyzed before PGO was enabled: https://github.com/dotnet/runtime/issues/84264#issuecomment-1521994085
BDN's strategy doesn't run the benchmark enough, because each iteration is long running, and so (since the key benchmark methods are R2R'd) the test ends up measuring tier1-instr code.
Likely the analysis from https://github.com/dotnet/runtime/issues/84264#issuecomment-1501985127 is still relevant and explains the related tests regressions as well: we run out of inlining budget, in part because the benchmark method is small and there are quite a few large aggressive inline methods, and so we're unable to do some key inlines.
Tracking issue for this is https://github.com/dotnet/runtime/issues/85531.
Some of these improved with https://github.com/dotnet/runtime/pull/86551.
This regresses across the board, so not sure why we don't have more autofiling for it.
I can't repro on win-x64
Method | Job | Toolchain | value | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TryParseHex | Job-EEXSZV | \base\corerun.exe | 0 | 7.860 ns | 0.1366 ns | 0.1573 ns | 7.814 ns | 7.640 ns | 8.281 ns | 1.00 | 0.00 | - | NA |
TryParseHex | Job-EZIDSX | \diff\corerun.exe | 0 | 7.834 ns | 0.1461 ns | 0.1683 ns | 7.783 ns | 7.634 ns | 8.344 ns | 1.00 | 0.02 | - | NA |
But I can (perhaps) on win-arm64 (volterra)
Method | Job | Toolchain | value | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TryParseHex | Job-DFURZJ | \base\corerun.exe | 0 | 3.265 ns | 0.0267 ns | 0.0237 ns | 3.257 ns | 3.235 ns | 3.316 ns | 1.00 | 0.00 | - | NA |
TryParseHex | Job-URZCKJ | \diff\corerun.exe | 0 | 3.876 ns | 0.0899 ns | 0.0751 ns | 3.849 ns | 3.794 ns | 4.034 ns | 1.19 | 0.02 | - | NA |
base
00.81% 4.1E+05 ? Unknown
48.82% 2.458E+07 Tier-1 [491fed18-8a14-46f6-b30c-6320dc770919]Runnable_0.WorkloadActionUnroll(int64)
23.52% 1.184E+07 Tier-1 [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
12.69% 6.39E+06 Tier-1 [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.92% 4.49E+06 Tier-1 [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.37% 2.2E+06 Tier-1 [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)
diff
50.62% 2.666E+07 Tier-1 [d735f3a6-719a-4885-9631-16a7aeff132c]Runnable_0.WorkloadActionUnroll(int64)
23.94% 1.261E+07 Tier-1 [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
10.73% 5.65E+06 Tier-1 [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.62% 4.54E+06 Tier-1 [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.08% 2.15E+06 Tier-1 [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)
Note the very high overhead. WorkoadActionUnroll
should have the AggressiveOptimization
attribute so its codegen does not vary.
PGO codegen for TryParseBinaryIntegerHexOrBinaryNumberStyle
looks good, all the hot code is adjacent and the method is straight line code. So not clear why it is ~7% or so slower.
Path lengths are similar but PGO does one extra STP/LDP and a few more register arg moves. So perhaps that's the explanation?
These two appear to be arm64 specific.
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EndsWith | Job-INARIQ | \base-rel\corerun.exe | 4 | 2.712 ns | 0.0209 ns | 0.0163 ns | 2.712 ns | 2.680 ns | 2.734 ns | 1.00 | 0.00 | - | NA |
EndsWith | Job-YXKXJS | \diff-rel\corerun.exe | 4 | 3.357 ns | 0.2569 ns | 0.2959 ns | 3.175 ns | 3.079 ns | 3.928 ns | 1.23 | 0.10 | - | NA |
base
01.52% 5.7E+05 ? Unknown
35.26% 1.319E+07 Tier-1 [18850c97-dca6-4aad-8393-0715e0b95a4a]Runnable_0.WorkloadActionUnroll(int64)
34.51% 1.291E+07 Tier-1 [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
27.72% 1.037E+07 Tier-1 [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)
diff
01.29% 5.1E+05 ? Unknown
35.20% 1.392E+07 Tier-1 [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
31.40% 1.242E+07 Tier-1 [72707307-0596-4e84-8515-1a5c07b80d3a]Runnable_0.WorkloadActionUnroll(int64)
31.10% 1.23E+07 Tier-1 [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)
So issue appears to be in SequenceEqual
?
Seems like PGO and non-PGO codegen is the same. Method is R2R'd and with PGO we do a tier1 instr, but do not add any probes. Explanation: this method is an intrinsic and not on the whitelist.
Fixing that (hack) gives:
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
EndsWith | Job-FBJVVQ | \base-rel\corerun.exe | 4 | 2.732 ns | 0.0330 ns | 0.0275 ns | 2.729 ns | 2.692 ns | 2.793 ns | 1.00 | - | NA |
EndsWith | Job-UKVWAD | \diff-rel\corerun.exe | 4 | 3.146 ns | 0.0472 ns | 0.0419 ns | 3.153 ns | 3.084 ns | 3.208 ns | 1.15 | - | NA |
EndsWith | Job-BJUBBV | \hack-rel\corerun.exe | 4 | 2.660 ns | 0.0364 ns | 0.0322 ns | 2.659 ns | 2.610 ns | 2.731 ns | 0.97 | - | NA |
and more broady
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EndsWith | Job-SDVJLL | \base-rel\corerun.exe | 4 | 2.685 ns | 0.0479 ns | 0.0448 ns | 2.679 ns | 2.627 ns | 2.766 ns | 1.00 | 0.00 | - | NA |
EndsWith | Job-FTSDAY | \diff-rel\corerun.exe | 4 | 3.151 ns | 0.0178 ns | 0.0158 ns | 3.151 ns | 3.126 ns | 3.181 ns | 1.17 | 0.02 | - | NA |
EndsWith | Job-EGIDWC | \hack-rel\corerun.exe | 4 | 3.125 ns | 0.0394 ns | 0.0368 ns | 3.125 ns | 3.080 ns | 3.180 ns | 1.16 | 0.02 | - | NA |
EndsWith | Job-SDVJLL | \base-rel\corerun.exe | 33 | 4.739 ns | 0.0558 ns | 0.0495 ns | 4.728 ns | 4.686 ns | 4.840 ns | 1.00 | 0.00 | - | NA |
EndsWith | Job-FTSDAY | \diff-rel\corerun.exe | 33 | 5.322 ns | 0.0576 ns | 0.0539 ns | 5.299 ns | 5.257 ns | 5.446 ns | 1.12 | 0.01 | - | NA |
EndsWith | Job-EGIDWC | \hack-rel\corerun.exe | 33 | 5.369 ns | 0.0265 ns | 0.0235 ns | 5.369 ns | 5.333 ns | 5.419 ns | 1.13 | 0.01 | - | NA |
EndsWith | Job-SDVJLL | \base-rel\corerun.exe | 512 | 32.020 ns | 0.0209 ns | 0.0175 ns | 32.023 ns | 31.999 ns | 32.059 ns | 1.00 | 0.00 | - | NA |
EndsWith | Job-FTSDAY | \diff-rel\corerun.exe | 512 | 32.422 ns | 0.0351 ns | 0.0328 ns | 32.417 ns | 32.383 ns | 32.495 ns | 1.01 | 0.00 | - | NA |
EndsWith | Job-EGIDWC | \hack-rel\corerun.exe | 512 | 31.819 ns | 0.0370 ns | 0.0346 ns | 31.815 ns | 31.776 ns | 31.893 ns | 0.99 | 0.00 | - | NA |
But does not explain why there is a PGO regression. And as you can see the Size=4 results are not very stable.
Similar diffs for EndsWith
(better layout, slightly higher prolog/epilog costs).
These spiked up but then recovered and match their longer-term behavior
Fixed by physical promotion
This one is more substantially regressed on amd64 HW...
This doesn't repro on my local Zen3 box
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22000.2176/21H2/SunValley) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-EGKMFB : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-KOAKXG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
InvMt | Job-EGKMFB | \base-rel\corerun.exe | 1.540 ms | 0.0146 ms | 0.0137 ms | 1.540 ms | 1.521 ms | 1.568 ms | 1.00 | 12.5000 | 105.07 KB | 1.00 |
InvMt | Job-KOAKXG | \diff-rel\corerun.exe | 1.513 ms | 0.0077 ms | 0.0072 ms | 1.512 ms | 1.504 ms | 1.527 ms | 0.98 | 12.5000 | 105.07 KB | 1.00 |
Perf lab is running Ryzen 7 3700 PRO.
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NextSingle | Job-IRALDH | \base-rel\corerun.exe | 4.697 ns | 0.0866 ns | 0.0810 ns | 4.696 ns | 4.528 ns | 4.796 ns | 1.00 | 0.00 | - | NA |
NextSingle | Job-VNWIBR | \diff-rel\corerun.exe | 8.382 ns | 0.1170 ns | 0.1094 ns | 8.394 ns | 8.229 ns | 8.574 ns | 1.79 | 0.04 | - | NA |
Profiling
base
00.95% 4.8E+05 ? Unknown
36.70% 1.859E+07 Tier-1 [System.Private.CoreLib]Random+CompatPrng.InternalSample()
34.94% 1.77E+07 Tier-1 [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
12.38% 6.27E+06 Tier-1 [System.Private.CoreLib]Random.NextSingle()
07.54% 3.82E+06 Tier-1 [cb5bf117-1cbd-438a-bbce-a71239c42b3d]Runnable_0.WorkloadActionUnroll(int64)
06.85% 3.47E+06 Tier-1 [MicroBenchmarks]Perf_Random.NextSingle()
00.26% 1.3E+05 native clrjit.dll
00.12% 6E+04 native coreclr.dll
00.12% 6E+04 native ntoskrnl.exe
00.10% 5E+04 native ntdll.dll
diff
02.33% 2.07E+06 ? Unknown
79.57% 7.061E+07 Tier-1 [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
14.97% 1.328E+07 Tier-1 [MicroBenchmarks]Perf_Random.NextSingle()
02.56% 2.27E+06 Tier-1 [c48f5bab-b8a5-4a49-b7a3-55a58e477b40]Runnable_0.WorkloadActionUnroll(int64)
00.21% 1.9E+05 native clrjit.dll
00.17% 1.5E+05 native coreclr.dll
00.14% 1.2E+05 native ntoskrnl.exe
Looks like with PGO we inline InternalSample, and this hurts perf. Why?
Root cause is lack of if conversion in InternalSample, once PGO has inlined it into Random+Net5CompatSeedImpl.NextSingle.
(with DOTNET_JitDoIfConversion=0)
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NextSingle | Job-UYNTQT | \base\corerun.exe | 10.451 ns | 0.1370 ns | 0.1578 ns | 10.424 ns | 10.252 ns | 10.94 ns | 1.00 | 0.00 | - | NA |
NextSingle | Job-FNVAZN | \diff\corerun.exe | 9.709 ns | 0.3340 ns | 0.3846 ns | 9.751 ns | 8.154 ns | 10.10 ns | 0.93 | 0.04 | - | NA |
@jakobbotsch here's an example where not if converting in loops is painful. PGO undoes the improvements that came in with https://github.com/dotnet/runtime/pull/81267.
ARM64 does not seem to be affected for some reason.
https://github.com/dotnet/runtime/issues/79101 may show a similar problem
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FirstSingleSegment | Job-MECUZL | \base-rel\corerun.exe | 3.166 ns | 0.0275 ns | 0.0317 ns | 3.156 ns | 3.125 ns | 3.248 ns | 1.00 | 0.00 | - | NA |
FirstSingleSegment | Job-IYGZNT | \diff-rel\corerun.exe | 5.107 ns | 0.0487 ns | 0.0560 ns | 5.089 ns | 5.040 ns | 5.237 ns | 1.61 | 0.03 | - | NA |
base
00.57% 4.51E+06 ? Unknown
64.12% 5.094E+08 Tier-1 [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
16.92% 1.344E+08 Tier-1 [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.23% 7.33E+07 Tier-1 [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
05.37% 4.269E+07 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
02.28% 1.81E+07 Tier-1 [3e900f5f-7929-4719-a376-9aa860256700]Runnable_0.WorkloadActionUnroll(int64)
01.44% 1.142E+07 native coreclr.dll
diff
00.45% 5.24E+06 ? Unknown
57.87% 6.67E+08 Tier-1 [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
29.19% 3.364E+08 Tier-1 [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
06.48% 7.467E+07 Tier-1 [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
03.18% 3.665E+07 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
01.84% 2.116E+07 Tier-1 [60fe30c0-c2f0-4348-9bd8-9b632dabcde3]Runnable_0.WorkloadActionUnroll(int64)
00.91% 1.049E+07 native coreclr.dll
Same set of inlines, similar optimizations.
Main delta is code layout, in both First and in ChkCastClassSpecial. Not clear why this causes such a big perf diff as the profile data should be accurate.
The issue is that by default we don't profile casts, and without profile data, we assume casts will fall back to the helper 25% of the time. This leads the jit to move all the cast calls to the end of the method, and so each call site must jump to the call and then jump back into the regular flow:
;; base
;; size=41 bbWeight=0.50 PerfScore 7.12
G_M11072_IG110: ;; offset=0C55H
mov rdx, r8
call [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
;; size=9 bbWeight=0.12 PerfScore 0.41
G_M11072_IG111: ;; offset=0C5EH
mov rdx, gword ptr [rax+18H]
;; diff
jne SHORT G_M11072_IG52
;; size=45 bbWeight=0.50 PerfScore 7.12
G_M11072_IG49: ;; offset=0863H
mov rdx, gword ptr [rax+18H]
...
G_M11072_IG52: ;; offset=08B8H
mov rdx, r8
call [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
jmp SHORT G_M11072_IG49
BBWeight of IG52 in comes from QMARK expansion, we assume 25% chance?
With DOTNET_JitProfileCasts=1
:
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
FirstSingleSegment | Job-BCGWCL | \base\corerun.exe | 3.217 ns | 0.0506 ns | 0.0583 ns | 3.245 ns | 3.130 ns | 3.276 ns | 1.00 | - | NA |
FirstSingleSegment | Job-BXIQEJ | \diff\corerun.exe | 2.705 ns | 0.0115 ns | 0.0133 ns | 2.704 ns | 2.669 ns | 2.736 ns | 0.84 | - | NA |
@EgorBo where did we end up in the evaluation of cast profiling?
Looks like this one is bimodal, especially on amd64
This one is pretty stable except on linux x64:
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 8.0.100-preview.4.23260.5 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-AJUYHP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-IYXLDU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Burgers_1 | Job-PTZPWJ | /base-rel/corerun | 181.0 ms | 1.61 ms | 1.85 ms | 180.9 ms | 178.4 ms | 187.2 ms | 1.00 | 0.00 | 157.03 KB | 1.00 |
Burgers_1 | Job-ZGFEKO | /diff-rel/corerun | 185.4 ms | 2.27 ms | 2.62 ms | 184.8 ms | 181.5 ms | 191.4 ms | 1.02 | 0.02 | 157.03 KB | 1.00 |
Can't repro this one locally. I think I have the same HW as the lab (i7-8700) but not sure.
Also does not repro on my old Sandy Bridge
BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 20.04 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores .NET SDK=8.0.100-preview.6.23330.14 [Host] : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX Job-CWOHEC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX Job-ERFHVR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
Burgers_1 | Job-CWOHEC | /base-rel/corerun | 703.3 ms | 1.05 ms | 0.93 ms | 703.4 ms | 701.8 ms | 704.9 ms | 1.00 | 157.03 KB | 1.00 |
Burgers_1 | Job-ERFHVR | /diff-rel/corerun | 704.8 ms | 2.09 ms | 1.74 ms | 704.2 ms | 703.1 ms | 709.1 ms | 1.00 | 157.03 KB | 1.00 |
@EgorBo where did we end up in the evaluation of cast profiling?
I think it should be good to enable it for complex types of casts, but not in .NET 8.0 (afair, we don't profile all types of casts + never checked actual codegen size overhead from it)
Seems like it is bimodal and just happened to flip around the time we enabled Dynamic PGO.
Similar for Max
Recent regressions are https://github.com/dotnet/runtime/issues/88482
This is not a pure regression but rather an increase in variance (ala https://github.com/dotnet/runtime/issues/87324), with lower lows and higher highs -- but note this is only the case for windows x64 intel (and arm64); Linux is relatively stable and faster with PGO, and windows on amd64 seems ok as well.
Generally looks to have recovered.
Regression is linux-x64 only; win-x64 benefits from PGO, arm64 is more or less unchanged.
I can repro the regression on WSL2, however there aren't great profiling tools available there.
Method | Job | Toolchain | TestCase | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
EnumerateUsingIndexer | Job-OAGCJY | /base-rel/corerun | ArrayOfNumbers | 984.7 ns | 7.21 ns | 6.02 ns | 985.8 ns | 976.5 ns | 992.7 ns | 1.00 | - | NA |
EnumerateUsingIndexer | Job-AJUODA | /diff-rel/corerun | ArrayOfNumbers | 1,123.7 ns | 7.15 ns | 6.69 ns | 1,121.5 ns | 1,111.4 ns | 1,134.2 ns | 1.14 | - | NA |
Nominal profile (from winx64 is)
74.36% 3.889E+07 Tier-1 [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
18.53% 9.69E+06 Tier-1 [MicroBenchmarks]Perf_EnumerateArray.EnumerateUsingIndexer()
06.08% 3.18E+06 native coreclr.dll
00.34% 1.8E+05 native clrjit.dll
00.19% 1E+05 Tier-1 [System.Text.Json]JsonDocument.GetArrayLength(int32)
Using BDN's -p EP
it appears similar for the run on WSL2 ad suggests the regression is in GetArrayIndexElement
. Codegen for this method differs (mainly streamlined layout w/PGO). The PGO code looks better to me. Without more details it is hard to figure out where else to look.
EnumerateUsingIndexer
inlines GetArrayLength
with PGO, but the loop invoking GetArrayIndexElement
is identical and looks like it iterates ~300? times.
Actually, looking at the profile data, it seems wrong. A big of digging turned out to be a bug in the 32 bit counter helper on linux, see #89340.
Fixing that doesn't alter the codegen, it just makes the loop seem a bit less hot (but still very hot).
Seems to be linux ampere only? Other platform improved w/pgo.
The recent regression is https://github.com/dotnet/runtime/issues/89259.
Also seems to be linux ampere only?
Yet another one that happens on linux ampere only
There's perhaps a small chance these will improve with https://github.com/dotnet/runtime/pull/89350 -- there's been one data point so far that is not encouraging but too soon to tell.
Bimodal
Was curious if I could track down why this is bimodal. I have "good"/"bad" profiles and they're similar but in the bad case we're spending more time in the key method:
;; good
98.95% 5.279E+07 Tier-1 [System.Private.CoreLib]SpanHelpers.IndexOf(wchar&,int32,wchar&,int32)
00.28% 1.5E+05 native clrjit.dll
00.15% 8E+04 Tier-1 [System.Text.RegularExpressions]Regex.RunSingleMatch(value class System.Text.RegularExpressions.RegexRunnerMode,int32,class System.String,int32,int32,int32)
00.15% 8E+04 native coreclr.dll
00.09% 5E+04 Tier-1 [651af330-5a3e-4220-ba8d-07c806fb3049]Runnable_0.WorkloadActionUnroll(int64)
00.09% 5E+04 Tier-1 [MicroBenchmarks]Perf_Regex_Industry_RustLang_Sherlock.Count()
00.09% 5E+04 Tier-1 [MicroBenchmarks]Perf_Regex_Industry.Count(class System.Text.RegularExpressions.Regex,class System.String)
00.07% 4E+04 native ntoskrnl.exe
;; bad
99.22% 6.976E+07 Tier-1 [System.Private.CoreLib]SpanHelpers.IndexOf(wchar&,int32,wchar&,int32)
00.24% 1.7E+05 native clrjit.dll
00.18% 1.3E+05 native coreclr.dll
00.11% 8E+04 Tier-1 [System.Text.RegularExpressions]Regex.RunSingleMatch(value class System.Text.RegularExpressions.RegexRunnerMode,int32,class System.String,int32,int32,int32)
00.06% 4E+04 Tier-1 [MicroBenchmarks]Perf_Regex_Industry_RustLang_Sherlock.Count()
00.06% 4E+04 native ntdll.dll
However, the PGO run seems to more consistently get the worse result. Investigating.
;; base
Raw samples for [System.Private.CoreLib]SpanHelpers.IndexOf(wchar&,int32,wchar&,int32) at 0x00007FFA3E6DB540 -- 0x00007FFA3E6DB849 (length 0x0309)
0x0031 : 1
0x00A8 : 1
0x0167 : 717
0x016B : 2
0x0170 : 899
0x0174 : 126
0x0179 : 1260
0x017B : 1892
0x0182 : 370
0x0188 : 2
0x0264 : 4
0x026A : 2
0x028B : 2
0x028F : 1
G_M3489_IG17: ;; offset=0161H
vpcmpeqw ymm0, ymm6, ymmword ptr [rsi+2*r14]
lea rcx, [r14+rbp]
vpcmpeqw ymm1, ymm7, ymmword ptr [rsi+2*rcx]
vpand ymm0, ymm0, ymm1
vptest ymm0, ymm0
jne SHORT G_M3489_IG19
;; size=26 bbWeight=4 PerfScore 51.33
G_M3489_IG18: ;; offset=017BH
add r14, 16
cmp r14, r12
je G_M3489_IG29
cmp r14, r15
jle SHORT G_M3489_IG17
mov r14, r15
jmp SHORT G_M3489_IG17
;; diff
Raw samples for [System.Private.CoreLib]SpanHelpers.IndexOf(wchar&,int32,wchar&,int32) at 0x00007FFA3E701F60 -- 0x00007FFA3E70229E (length 0x033E)
0x0000 : 1
0x0010 : 1
0x0031 : 1
0x0061 : 1
0x00A3 : 53
0x00AC : 425
0x00B0 : 4
0x00B5 : 1894
0x00BB : 4574
0x00BF : 11
0x00C2 : 5
0x00D9 : 5
0x00F4 : 1
G_M3489_IG04: ;; offset=009DH
vpcmpeqw ymm0, ymm6, ymmword ptr [rbx+2*r14]
lea rax, [r14+rbp]
vpcmpeqw ymm1, ymm7, ymmword ptr [rbx+2*rax]
vpand ymm8, ymm0, ymm1
vptest ymm8, ymm8
jne G_M3489_IG26 // lots of stalls here
;; size=30 bbWeight=37383.95 PerfScore 479760.74
G_M3489_IG05: ;; offset=00BBH
add r14, 16
cmp r14, r12
je SHORT G_M3489_IG08
;; size=9 bbWeight=37383.95 PerfScore 56075.93
G_M3489_IG06: ;; offset=00C4H
cmp r14, r15
jle SHORT G_M3489_IG04
Wondering if this is because in diff the code straddles a 32 byte boundary?
We wouldn't align anyways as we don't think this is a loop.
No regression here.
(see also https://github.com/dotnet/runtime/issues/84264#issuecomment-1502623532)
Did not regress on windows-x64, but did everywhere else.
Ditto.
These two are quite likely the same issue. Last I looked into sorting, it was layout related; let's look again.
Looks like noise
Improved on x64. regressed on arm64.
Ah, likely the same issue with barriers as noted below: https://github.com/dotnet/runtime/issues/87194#issuecomment-1656276610
And related...
amd64 only
Does not repro on my local Zen3 machine:
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-XLXHLH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-UGGZHH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
Combine_1 | Job-XLXHLH | \base-rel\corerun.exe | 62.43 us | 0.298 us | 0.249 us | 62.44 us | 62.01 us | 63.00 us | 1.00 | - | NA |
Combine_1 | Job-UGGZHH | \diff-rel\corerun.exe | 61.18 us | 0.137 us | 0.128 us | 61.21 us | 60.93 us | 61.35 us | 0.98 | - | NA |
In fact the whole suite looks pretty good, save perhaps _4
:
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-WPDJSJ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-YNXXXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
Combine_1 | Job-WPDJSJ | \base-rel\corerun.exe | 62.04 us | 0.139 us | 0.108 us | 62.04 us | 61.80 us | 62.23 us | 1.00 | - | NA |
Combine_1 | Job-YNXXXX | \diff-rel\corerun.exe | 60.92 us | 0.096 us | 0.090 us | 60.86 us | 60.85 us | 61.10 us | 0.98 | - | NA |
Combine_2 | Job-WPDJSJ | \base-rel\corerun.exe | 79.40 us | 0.182 us | 0.162 us | 79.40 us | 79.16 us | 79.74 us | 1.00 | - | NA |
Combine_2 | Job-YNXXXX | \diff-rel\corerun.exe | 78.70 us | 0.091 us | 0.085 us | 78.67 us | 78.62 us | 78.85 us | 0.99 | - | NA |
Combine_3 | Job-WPDJSJ | \base-rel\corerun.exe | 69.93 us | 0.093 us | 0.087 us | 69.90 us | 69.85 us | 70.11 us | 1.00 | - | NA |
Combine_3 | Job-YNXXXX | \diff-rel\corerun.exe | 69.88 us | 0.040 us | 0.034 us | 69.87 us | 69.86 us | 69.97 us | 1.00 | - | NA |
Combine_4 | Job-WPDJSJ | \base-rel\corerun.exe | 83.42 us | 0.035 us | 0.030 us | 83.41 us | 83.39 us | 83.50 us | 1.00 | - | NA |
Combine_4 | Job-YNXXXX | \diff-rel\corerun.exe | 95.03 us | 0.013 us | 0.010 us | 95.03 us | 95.01 us | 95.05 us | 1.14 | - | NA |
Combine_5 | Job-WPDJSJ | \base-rel\corerun.exe | 72.14 us | 0.012 us | 0.011 us | 72.14 us | 72.13 us | 72.17 us | 1.00 | - | NA |
Combine_5 | Job-YNXXXX | \diff-rel\corerun.exe | 72.15 us | 0.059 us | 0.049 us | 72.13 us | 72.12 us | 72.29 us | 1.00 | - | NA |
Combine_6 | Job-WPDJSJ | \base-rel\corerun.exe | 83.46 us | 0.104 us | 0.092 us | 83.40 us | 83.38 us | 83.65 us | 1.00 | - | NA |
Combine_6 | Job-YNXXXX | \diff-rel\corerun.exe | 83.43 us | 0.064 us | 0.057 us | 83.41 us | 83.39 us | 83.55 us | 1.00 | - | NA |
Combine_7 | Job-WPDJSJ | \base-rel\corerun.exe | 94.72 us | 0.112 us | 0.093 us | 94.68 us | 94.66 us | 94.93 us | 1.00 | - | NA |
Combine_7 | Job-YNXXXX | \diff-rel\corerun.exe | 94.86 us | 0.177 us | 0.166 us | 94.81 us | 94.69 us | 95.26 us | 1.00 | - | NA |
Combine_8 | Job-WPDJSJ | \base-rel\corerun.exe | 116.49 us | 0.078 us | 0.065 us | 116.46 us | 116.43 us | 116.66 us | 1.00 | - | NA |
Combine_8 | Job-YNXXXX | \diff-rel\corerun.exe | 117.22 us | 0.260 us | 0.243 us | 117.18 us | 116.88 us | 117.69 us | 1.01 | - | NA |
Win-x64 only
This one repros
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-RBZAAU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-AJWTXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Segment | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Slice_Repeat | Job-RBZAAU | \base-rel\corerun.exe | Multiple | 32.03 ns | 0.020 ns | 0.017 ns | 32.03 ns | 32.01 ns | 32.06 ns | 1.00 | - | NA |
Slice_Repeat | Job-AJWTXX | \diff-rel\corerun.exe | Multiple | 43.23 ns | 0.274 ns | 0.243 ns | 43.16 ns | 42.93 ns | 43.75 ns | 1.35 | - | NA |
Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don't inline it, and so V05 ends up getting address exposed and has more expensive copy semantics.
@egorbo example where not doing an inline in a cold block impacts codegen in a hot block.
base
01.75% 8.8E+05 ? Unknown
56.39% 2.828E+07 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
13.86% 6.95E+06 Tier-1 [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
10.91% 5.47E+06 Tier-1 [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.71% 4.87E+06 native coreclr.dll
06.28% 3.15E+06 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
00.62% 3.1E+05 Tier-1 [7c853c35-6121-4c85-8327-3f1f8585f3b1]Runnable_0.WorkloadActionUnroll(int64)
00.28% 1.4E+05 native clrjit.dll
00.12% 6E+04 native ntoskrnl.exe
00.06% 3E+04 native ntdll.dll
diff
00.69% 3.5E+05 ? Unknown
74.57% 3.797E+07 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
11.82% 6.02E+06 Tier-1 [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
05.32% 2.71E+06 native coreclr.dll
03.69% 1.88E+06 Tier-1 [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
03.18% 1.62E+06 Tier-1 [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
00.37% 1.9E+05 native ntoskrnl.exe
00.26% 1.3E+05 native clrjit.dll
x64 linux only. Likely the same as https://github.com/dotnet/runtime/issues/87194#issuecomment-1646036419
Windows intel x64 only.
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
QuickSortSpan | Job-FBLSKS | \base-rel\corerun.exe | 512 | 12.76 us | 2.346 us | 2.304 us | 13.86 us | 8.584 us | 15.80 us | 1.00 | 0.00 | - | NA |
QuickSortSpan | Job-UVWYZR | \diff-rel\corerun.exe | 512 | 14.04 us | 3.065 us | 3.530 us | 13.52 us | 10.049 us | 19.78 us | 1.18 | 0.42 | - | NA |
At first look, it seems like BDN is not iterating this enough... we are measuring Tier0 code.
base
55.66% 2.31E+06 Tier-0 [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
18.80% 7.8E+05 native clrjit.dll
15.18% 6.3E+05 Tier-1 [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
04.82% 2E+05 native coreclr.dll
diff
37.39% 1.23E+06 Tier-1 [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
31.91% 1.05E+06 native coreclr.dll
26.44% 8.7E+05 Tier-0 [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
03.04% 1E+05 native clrjit.dll
The per-iteration times tell a similar story:
base
000 1021.818 -- 1052.150 : 30.332
001 1053.657 -- 1083.828 : 30.171
002 1085.532 -- 1115.977 : 30.445
003 1117.422 -- 1147.984 : 30.562
004 1149.518 -- 1180.699 : 31.181
005 1182.397 -- 1213.166 : 30.769
006 1214.624 -- 1245.324 : 30.700
007 1247.707 -- 1273.592 : 25.886
008 1275.115 -- 1283.721 : 8.607
009 1285.118 -- 1293.938 : 8.820
010 1295.437 -- 1304.091 : 8.655
011 1305.529 -- 1314.128 : 8.599
012 1315.535 -- 1324.085 : 8.550
013 1325.503 -- 1334.060 : 8.557
014 1335.444 -- 1343.915 : 8.471
diff
000 816.876 -- 872.133 : 55.258
001 873.602 -- 940.129 : 66.526
002 942.247 -- 997.177 : 54.931
003 998.693 -- 1020.673 : 21.980
004 1022.154 -- 1032.488 : 10.334
005 1033.906 -- 1044.327 : 10.421
006 1045.780 -- 1056.235 : 10.455
007 1059.274 -- 1069.671 : 10.398
008 1071.102 -- 1081.520 : 10.418
009 1082.967 -- 1093.448 : 10.481
010 1094.849 -- 1105.169 : 10.320
011 1106.637 -- 1117.139 : 10.502
012 1118.663 -- 1129.091 : 10.427
013 1130.485 -- 1140.868 : 10.384
014 1142.225 -- 1152.484 : 10.260
Where for diff (pgo) we will eagerly instrument so the tier0 code will be slower. But even so, the optimized code is slower...
@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).
Think this might be caused by JCC errata. Main optimizations are identical, but code layout differs.
;;
cmp dword ptr [rbx+4*r8], edx
jge SHORT G_M24415_IG04
;; size=17 bbWeight=5.09 PerfScore 27.98
G_M24415_IG07: ;; offset=0041H
and note how in diff that jge
straddles a 32 byte boundary. Profiling shows there is a prominent peak at offset 0x2A
that is not there in the baseline version.
also win-x64 only
Same underlying issue as https://github.com/dotnet/runtime/issues/87194#issuecomment-1655937834
arm64 only
Repros on my volterra:
BenchmarkDotNet v0.13.7-nightly.20230724.45, Windows 11 (10.0.22621.2070/22H2/2022Update/SunValley2) Snapdragon 8cx Gen 3 3.0 GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 8.0.0 (8.0.23.32907), Arm64 RyuJIT AdvSIMD Job-EGKPHU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-FFUPKC : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ConcurrentBag | Job-EGKPHU | \base-rel\corerun.exe | 512 | 15.78 us | 0.107 us | 0.100 us | 15.73 us | 15.68 us | 16.00 us | 1.00 | 2.6309 | 2.5667 | 0.0642 | 16.16 KB | 1.00 |
ConcurrentBag | Job-FFUPKC | \diff-rel\corerun.exe | 512 | 22.78 us | 0.119 us | 0.112 us | 22.76 us | 22.63 us | 22.99 us | 1.44 | 2.5491 | 2.4547 | - | 16.16 KB | 1.00 |
base
12.70% 4.89E+06 ? Unknown
43.43% 1.672E+07 Tier-1 [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
32.36% 1.246E+07 native coreclr.dll
03.12% 1.2E+06 native ntoskrnl.exe
02.68% 1.03E+06 Tier-1 [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.05% 7.9E+05 Tier-1 [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
01.51% 5.8E+05 Tier-1 [System.Private.CoreLib]SZGenericArrayEnumeratorBase.MoveNext()
diff
06.89% 2.71E+06 ? Unknown
60.37% 2.376E+07 Tier-1 [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
23.04% 9.07E+06 native coreclr.dll
02.69% 1.06E+06 Tier-1 [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
02.59% 1.02E+06 Tier-1 [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.57% 1.01E+06 native ntoskrnl.exe
So issue is evidently in LocalPush
.
In the base (no-pgo) jit we form a CSE and this lets us use ldar
, in the diff (pgo) jit we don't do the cse and end up emitting a separate barrier. That may be the culprit.
BASE
Generating: N507 ( 1, 1) [000436] ----------- t436 = LCL_VAR byref V16 cse0 x28 REG x28 $1c8
/--* t436 byref
Generating: N509 ( 3, 2) [000174] V---GO----- t174 = * IND ref REG x0 $156
IN0062: ldapr x0, [x28]
DIFF
Generating: N347 ( 1, 1) [000172] ----------- t172 = LCL_VAR ref V00 this u:1 x22 REG x22 $80
/--* t172 ref
Generating: N349 ( 3, 4) [000356] -c--------- t356 = * LEA(b+8) byref REG NA
/--* t356 byref
Generating: N351 ( 6, 6) [000174] V---GO----- t174 = * IND ref REG x0 $156
IN00ac: ldr x0, [x22, #0x08]
IN00ad: dmb ishld
In base there are just two sampling hot spots, both tied to ldapr
:
0x0044 : 1037
0x00DC : 162
IN0005: 000040 swpal w1, w1, [x21]
IN0006: 000044 add x22, x0, #28
IN0007: 000048 ldapr w23, [x22]
IN002b: 0000D8 ldapr w25, [x24]
IN002c: 0000DC ldrb w1, [x0, #0x34]
In diff there are more hot spots, and all hottest samples are near the dmb
s.
0x003C : 655
0x0340 : 367
0x00C4 : 239
0x00AC : 201
0x0064 : 208
IN0006: 00003C ldr w1, [x0, #0x1C]
IN0007: 000040 dmb ishld
IN00c6: 00033C dmb ish
IN00c7: 000340 str wzr, [x0, #0x2C]
IN0027: 0000C0 dmb ish
IN0028: 0000C4 str w1, [x0, #0x1C]
IN0021: 0000A8 dmb ishld
IN0022: 0000AC ldr x2, [fp, #0x28] // [V01 arg1]
IN000f: 000060 dmb ishld
IN0010: 000064 ldrb w1, [x0, #0x34]
@EgorBo example we were chatting about.
@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).
I took a look at the benchmark source code and it's safe to do it (the benchmark ID won't change, as the InvocationsPerIteration
const is not an argument or a parameter for this benchmark)
While there are still a few benchmarks where the analysis is unclear, they are isolated to specific OS/ABI combinations. So I'm going to close this out.
Did BubbleSort2 ever get looked at?
Please keep in mind that both bubble sort and IndexOf
are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.
Please keep in mind that both bubble sort and
IndexOf
are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.
Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.
Major instability starting with this commit.
Please keep in mind that both bubble sort and
IndexOf
are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.
Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code.
Major instability starting with this commit.
Feel free to add this kind of thing to https://github.com/dotnet/runtime/issues/87324
I will do that going forward, didn't know about that issue :)
This is specifically on Windows x86.
This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of https://github.com/dotnet/runtime/issues/84264 which tracked regressions from PGO before it was enabled.
The report below is collated from the following autofiling reports.
The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for
System.Text.Json.Tests.Perf_Get.GetByte
, which has improved already).Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.
Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~
[edit: had to regenerate the table once already, as the scoring logic was off] [edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different...]
cc @dotnet/jit-contrib
1.36
1.37
3.39
2.27
3.04
1.76
1.63
1.92
1.85
1.47
1.49
1.99
2.43
2.19
3.54
2.00
4.73
2.01
2.68
1.82
1.64
1.44
1.46
1.18
2.17
2.33
1.94
1.58
1.87
1.43
1.62
1.81
1.50
2.13
1.41
1.28
1.44
1.42
1.15
1.09
1.44
1.27
1.58
1.62
1.39
1.39
1.32
1.29
1.37
1.29
1.43
1.28
1.48
1.56
1.58
1.66
2.14
2.67
2.24
1.33
1.34
1.40
1.36
1.26
1.38
1.44
1.42
1.55
1.46
1.29
1.44
1.41
1.71
1.33
1.33
1.32
1.18
1.39
1.28
1.15
1.46
1.57
1.39
1.28
1.42
1.31
1.38
1.50
1.38
1.59
1.62
1.37
2.22
2.49
2.24
1.25
1.23
1.26
1.46
1.40
1.43
1.39
1.31
1.27
1.19
1.50
1.37
1.50
1.34
ldapr
1.30
1.30