Closed jnyrup closed 7 months ago
I've bisected the change to db717e30839c532cad5b269ac11c5d2c91dad639 CC: @jakobbotsch
I haven't figured out how to instruct BDN to use corerun
, but with the benchmark below the numbers are as follow:
7c265c396e6: 67.6 µs (the commit before db717e30839c532cad5b269ac11c5d2c91dad639) db717e30839c532cad5b269ac11c5d2c91dad639: 84.8 µs
var c = new Benchmark();
for (int i = 0; i < 100_000; i++) c.STIntersection();
var ts = Stopwatch.StartNew();
for (int i = 0; i < 100_000; i++) c.STIntersection();
var elapsed = ts.Elapsed / 100_000;
Console.WriteLine(elapsed.TotalMicroseconds);
When I figure out how to get BDN to use corerun
, I'll add the jitted assembly.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | jnyrup |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `area-VM-coreclr`, `untriaged` |
Milestone: | - |
I see no difference on my machine:
BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
Job-APOFVS : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
Job-TLKCDS : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0 Platform=X64 Server=True
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 51.37 us | 0.307 us | 0.288 us | 1.00 | 0.00 | 7,411 B |
STIntersection | .NET 8.0 | 51.01 us | 0.303 us | 0.284 us | 0.99 | 0.01 | 6,980 B |
Since you are on a Comet Lake CPU this could potentially be JCC erratum. I'll try to track down the codegen diffs.
Oddly almost all time on the benchmark is spent inside native code according to https://github.com/AndyAyersMS/InstructionsRetiredExplorer:
Samples for dotnet: 29422 events for Benchmark Intervals
Jitting : 00.00% 0 samples 670 methods
JitInterface : 00.00% 0 samples
Jit-generated code: 00.74% 9.5E+05 samples
Jitted code : 00.74% 9.5E+05 samples
MinOpts code : 00.00% 0 samples
FullOpts code : 00.74% 9.5E+05 samples
Tier-0 code : 00.00% 0 samples
Tier-1 code : 00.00% 0 samples
R2R code : 00.00% 0 samples
81.48% 1.042E+08 native SqlServerSpatial140.dll
10.51% 1.344E+07 native msvcr120.dll
03.99% 5.11E+06 native ntdll.dll
02.68% 3.43E+06 native coreclr.dll
00.20% 2.6E+05 FullOpt [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData)
00.17% 2.2E+05 native ntoskrnl.exe
00.13% 1.6E+05 native System.Private.CoreLib.dll
00.06% 8E+04 native kernel32.dll
00.05% 7E+04 FullOpt [Microsoft.SqlServer.Types]SridList.GetEllipsoidParameters(int32)
00.05% 7E+04 FullOpt [Microsoft.SqlServer.Types]GeoData.CreateArrays(int32,int32,int32,int32,int32,int32)
Benchmark: found 15 intervals; mean interval 852.813ms
I'm not really sure how that's possible. Can you try running this on your machine? To do so run BDN with -p ETW
to produce a .etl trace file and run the tool with <tool.exe> <path to ETL file produced> -benchmark
.
The overall diffs I see with and without early liveness (DOTNET_JitEnableEarlyLivenessRange=0
in checked builds) are
The diffs look good, but I'm not really sure where to focus my investigation to look for potential JCC erratum.
Added BenchmarkDotNet.Diagnostics.Windows
and [EtwProfiler]
to the test project and ran:
dotnet build -c Release -f net8.0
.\bin\Release\net8.0\win-x64\STIntersection.exe -p ETW
Analyzing the produced .etl file using InstructionsRetiredExplorer gives:
Mining ETL from C:\dev\stintersection\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-20240114-171746.etl for process dotnet
Found process [22864] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:McEtIVxF32pD7Svj7l1O9o2MUcz1FGqXcUUCENE92z4"
PMC interval now 10000
Found process [11156] dotnet: "dotnet" bc400ac9-1bf8-410d-aece-32f227398082.dll --anonymousPipes 1332 1348 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 8.0, Server=True, Toolchain=.NET 8.0" --benchmarkId 0
==> benchmark process is [11156]
eh? unknown module ID 140704149553152
eh? unknown module ID 140704149553152
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
Samples for dotnet: 17977 events for Benchmark Intervals
Jitting : 00,00% 0 samples 670 methods
JitInterface : 00,00% 0 samples
Jit-generated code: 00,57% 6E+05 samples
Jitted code : 00,57% 6E+05 samples
MinOpts code : 00,00% 0 samples
FullOpts code : 00,57% 6E+05 samples
Tier-0 code : 00,00% 0 samples
Tier-1 code : 00,00% 0 samples
R2R code : 00,00% 0 samples
67,84% 7,1E+07 native SqlServerSpatial140.dll
25,26% 2,644E+07 native msvcr120.dll
03,15% 3,3E+06 native ntdll.dll
02,66% 2,78E+06 native coreclr.dll
00,21% 2,2E+05 native System.Private.CoreLib.dll
00,08% 8E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32)
00,08% 8E+04 native ntoskrnl.exe
00,07% 7E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData)
00,06% 6E+04 FullOpt [Perfolizer]dynamicClass.IL_STUB_PInvoke(pMT: 00007FF83D543050,pMT: 00007FF83D542B88,pMT: 00007FF83D542B88,float64,pMT: 00007FF83D5431F0)
Benchmark: found 15 intervals; mean interval 698,169ms
I'll try finding a colleague on Monday with a newer Intel CPU (but otherwise similar setup) and have them re-run the benchmarks.
Running this on a colleague's machine with a Raptor Lake CPU. If it can make any difference, he's running Windows 11 while I'm on Windows 10. We both have Sophos anti-virus installed which can be a CPU hog, but from looking in the Task Manager it seems inactive during the benchmarks.
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3007/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.101
[Host] : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
Job-EAAWAC : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
Job-YZXMWS : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0 Platform=X64 Server=True
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 41.14 us | 0.778 us | 0.833 us | 1.00 | 0.00 |
STIntersection | .NET 8.0 | 219.47 us | 2.816 us | 2.634 us | 5.32 | 0.12 |
Mining ETL from C:\dev\Jonas\stintersection\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-.NET 8.0-20240115-102154.etl for process dotnet
Found process [28256] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:VTsk2TzcFC7Bd1HOtZmxdfIvwWm1YzhZJp3qZT3CiFc"
Found process [29456] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:Le5PLwrPKmY2ZTI3rRdBpEzf4K6hBP+l7pIZ7kU4ZdA"
PMC interval now 10000
Found process [19268] dotnet: "dotnet" 2891f347-4d17-4526-b7e7-5d550281f902.dll --anonymousPipes 1680 1744 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 8.0, Server=True, Toolchain=.NET 8.0" --benchmarkId 0
==> benchmark process is [19268]
eh? unknown module ID 140714432282624
eh? unknown module ID 140714432282624
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,pMT: 00007FFAA1E40BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FFAA1E40BF8<class System.Object>)
Samples for dotnet: 23445 events for Benchmark Intervals
Jitting : 00,00% 0 samples 670 methods
JitInterface : 00,00% 0 samples
Jit-generated code: 00,18% 2,4E+05 samples
Jitted code : 00,18% 2,4E+05 samples
MinOpts code : 00,00% 0 samples
FullOpts code : 00,18% 2,4E+05 samples
Tier-0 code : 00,00% 0 samples
Tier-1 code : 00,00% 0 samples
R2R code : 00,00% 0 samples
49,97% 6,796E+07 native SqlServerSpatial140.dll
47,84% 6,507E+07 native msvcr120.dll
01,06% 1,44E+06 native ntdll.dll
00,61% 8,3E+05 native coreclr.dll
00,23% 3,1E+05 native ntoskrnl.exe
00,07% 9E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32)
00,05% 7E+04 native System.Private.CoreLib.dll
Benchmark: found 15 intervals; mean interval 907,085ms
Since the benchmark spends no time in managed code it seems very unlikely to be a codegen issue, but that's odd since you bisected it to a JIT change above. You can also try comparing the output of InstructionsRetiredExplorer for the .NET 7 and .NET 8 versions to see if there's any difference in where the benchmark is spending time.
Can you run the micro benchmark under a profiler in the base and diff to see exactly where the difference is coming from? For example, you can look at the two .etl files in PerfView, or you can try running the simplified benchmark under vTune.
InstructionsRetiredExplorer on the ETL files produced by BDN
I tried running the simplified benchmarks under VTune using the corerun
built from the two commits.
Here's the overview of comparing the "microarchitecture exploration" runs (which the tool suggested).
VTune/PerfView should be able to tell you more detailed which methods are taking up the time of the benchmark (you may need the .pdb for the native library). If you can reproduce a 5x difference in performance it should hopefully be very apparent where the difference is coming from.
I don't have access to a .pdb file for the native library, so I tried another approach where I decompiled the managed assembly using ILSpy such that I could fiddle with the components used in GLNativeMethods.GeodeticCombine
that handles calling the native code.
By changing whether GeoDataPinningAllocator
should be a class
or a struct
and whether is has a bool _disposed
field for safe double-disposal I got different results.
Benchmarking 7.0.15 vs 8.0.1 using BDN base: | Method | Runtime | Mean | Error | StdDev | Ratio | Code Size |
---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 69.83 us | 0.596 us | 0.529 us | 1.00 | 7,419 B | |
STIntersection | .NET 8.0 | 83.82 us | 0.793 us | 0.742 us | 1.20 | 6,898 B |
Delete bool _disposed and use if (_gchGeoDataPinner.IsAllocated) in Dispose() |
Method | Runtime | Mean | Error | StdDev | Ratio | Code Size |
---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 69.44 us | 0.956 us | 0.894 us | 1.00 | 7,408 B | |
STIntersection | .NET 8.0 | 67.95 us | 0.838 us | 0.784 us | 0.98 | 6,882 B |
Change GeoDataPinningAllocator from class to struct |
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 84.52 us | 1.652 us | 1.545 us | 1.00 | 0.00 | 7,442 B | |
STIntersection | .NET 8.0 | 83.97 us | 0.651 us | 0.609 us | 0.99 | 0.02 | 6,930 B |
Change GeoDataPinningAllocator from class to struct
Delete bool _disposed and use if (_gchGeoDataPinner.IsAllocated) in Dispose() |
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 67.57 us | 0.520 us | 0.461 us | 1.00 | 0.00 | 7,428 B | |
STIntersection | .NET 8.0 | 68.70 us | 1.078 us | 0.900 us | 1.02 | 0.02 | 6,919 B |
Running the simplified benchmark using the custom built corerun
base:
7c265c396e6: 66.4 µs
db717e30839: 81.5 µs
Delete bool _disposed
and use if (_gchGeoDataPinner.IsAllocated)
in Dispose()
7c265c396e6: 67.0 µs
db717e30839: 67.2 µs
Change GeoDataPinningAllocator
from class
to struct
7c265c396e6: 81.2 µs
db717e30839: 82.0 µs
Change GeoDataPinningAllocator
from class
to struct
Delete bool _disposed
and use if (_gchGeoDataPinner.IsAllocated)
in Dispose()
7c265c396e6: 68.4 µs
db717e30839: 67.4 µs
Thanks for sharing that. I noticed one interesting data point in your image from vtune above, namely the "Mixing Vectors" row which seems to suggest the diff is doing something odd that the base isn't. I wonder if it is related to our insertion of vzeroupper
.
I am unable to open the vtune project you shared, but perhaps you can dig into the "Mixing Vectors" there and see what it is about?
I'll try to see if I can get access to an Intel machine and reproduce it. In the meantime @tannergooding perhaps has an idea of what "Mixing Vectors" means here and whether or not this could be about vzeroupper
insertion (in particular whether a vzeroupper
issue would make sense given that it only seems to reproduce on Intel CPUs but not AMD ones).
The tooltip of "Mixing Vectors"
The Intel manual specifically mentions vzeroupper
Seemingly related intel community post
I wonder if this is something similar to https://github.com/dotnet/runtime/issues/95954#issuecomment-1856906695 (though some different native entrypoint perhaps).
Can you try running with AVX codegen disabled?
set DOTNET_EnableAVX=0
in your environment, or for BDN, add --envVars DOTNET_EnableAVX:0
(and run tests via BenchmarkSwitcher so it parses the command line).
Couple of questions:
BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
[Host] : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
Job-CYKLDY : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
Job-NPDFDR : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0 Platform=X64 Server=True
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 72.37 us | 1.230 us | 1.091 us | 1.00 | 0.00 | 7,411 B |
STIntersection | .NET 8.0 | 88.76 us | 0.704 us | 0.624 us | 1.23 | 0.02 | 6,980 B |
Setting DOTNET_EnableAVX=0
BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
[Host] : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
Job-OULJAX : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT SSE4.2
Job-LKAEUU : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT SSE4.2
EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0,DOTNET_EnableAVX=0 Platform=X64 Server=True
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 72.93 us | 1.291 us | 1.207 us | 1.00 | 0.00 | 7,409 B |
STIntersection | .NET 8.0 | 73.13 us | 1.335 us | 1.249 us | 1.00 | 0.02 | 6,985 B |
Setting DOTNET_EnableAVX2=0
(In case that be interesting)
BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
[Host] : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
Job-CRQRKC : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX
Job-OASVKT : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX
EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0,DOTNET_EnableAVX2=0 Platform=X64 Server=True
Method | Runtime | Mean | Error | StdDev | Ratio | Code Size |
---|---|---|---|---|---|---|
STIntersection | .NET 7.0 | 73.33 us | 1.068 us | 0.999 us | 1.00 | 7,411 B |
STIntersection | .NET 8.0 | 88.21 us | 0.610 us | 0.540 us | 1.21 | 6,980 B |
are you building your native code so it matches the capabilities of the CPU it is running on? Any assembly language bits lurking in there?
I don't have access to the source of the native code, it comes with https://www.nuget.org/packages/Microsoft.SqlServer.Types/14.0.1016.290
in your test I see tired comp, pgo, etc disabled -- was that just to isolate the perf issue or do you actually deploy things that way?
We don't deploy using any of these settings, those were just enabled to isolate the perf issue.
This can be closed as a duplicate of https://github.com/dotnet/runtime/issues/82132
The simple fix is we should be emitting vzeroupper
before transferring control to anything which may be "AVX unaware" (P/Invokes and some R2R methods) and therefore which could use the legacy encoded instructions.
On modern hardware, vzeroupper
tends to be free (handled in register renaming); and as noted above the transition penalty is already not expensive for AMD; but it may still incur cost on older hardware which can be a net negative for method calls which don't use floating-point/SIMD at all.
We have a separate issue (#11496) tracking our existing overuse of vzeroupper
in other areas, which itself really only needs to be before
or after
such transition boundaries, not as part of every managed function. This is because there is no penalty going from 128-bit legacy <-> 128-bit VEX
only when going between 128-bit legacy <-> 256-bit or higher VEX/EVEX
, the diagram for that (in the worst case) is:
vzeroupper
fix is https://github.com/dotnet/runtime/pull/98261, no longer see the regression locally with the fix
Description
In https://github.com/dotnet/runtime/issues/75455#issuecomment-1263821647 I got help to diagnose that due to how SqlServer.Types calls native code, it runs much slower with W^X enabled.
With .NET 8 I re-ran my benchmarks with W^X disabled and saw a 20% regression compared to .NET 7. STIntersection.zip
Do you have an idea what might have caused this performance regression and if it's "by design"?
Configuration
To make as fair a comparison between .NET 7 and .NET 8, I disabled Dynamic PGO, DATAS and W^X.
Regression?
Yes. The earliest .NET 8 runtime I could find was Preview1 which also had the regression. If earlier runtime builds are available, I'd be happy to bisect this further.
Data