SqlServer.Types is 20% slower on .NET 8

jnyrup commented 9 months ago

Description

In https://github.com/dotnet/runtime/issues/75455#issuecomment-1263821647 I got help to diagnose that due to how SqlServer.Types calls native code, it runs much slower with W^X enabled.

With .NET 8 I re-ran my benchmarks with W^X disabled and saw a 20% regression compared to .NET 7. STIntersection.zip

Do you have an idea what might have caused this performance regression and if it's "by design"?

Configuration

To make as fair a comparison between .NET 7 and .NET 8, I disabled Dynamic PGO, DATAS and W^X.

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3803/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-alpha.1.23619.5
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-VCKSWC : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
  Job-KIDGWI : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-WWREOB : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
  Job-ROIKLT : .NET 9.0.0 (9.0.23.61807), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0  Platform=X64  Server=True

Regression?

Yes. The earliest .NET 8 runtime I could find was Preview1 which also had the regression. If earlier runtime builds are available, I'd be happy to bisect this further.

Data

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	78.36 us	1.019 us	0.953 us	1.00	0.00	7,411 B
STIntersection	.NET 8.0-Preview1	94.64 us	0.939 us	0.878 us	1.21	0.02	6,980 B
STIntersection	.NET 8.0	96.58 us	1.565 us	1.675 us	1.23	0.03	7,377 B
STIntersection	.NET 9.0	95.33 us	1.588 us	1.486 us	1.22	0.03	7,152 B

jnyrup commented 8 months ago

I've bisected the change to db717e30839c532cad5b269ac11c5d2c91dad639 CC: @jakobbotsch

I haven't figured out how to instruct BDN to use corerun, but with the benchmark below the numbers are as follow:

7c265c396e6: 67.6 µs (the commit before db717e30839c532cad5b269ac11c5d2c91dad639) db717e30839c532cad5b269ac11c5d2c91dad639: 84.8 µs

var c = new Benchmark();
for (int i = 0; i < 100_000; i++) c.STIntersection();

var ts = Stopwatch.StartNew();
for (int i = 0; i < 100_000; i++) c.STIntersection();
var elapsed = ts.Elapsed / 100_000;
Console.WriteLine(elapsed.TotalMicroseconds);

When I figure out how to get BDN to use corerun, I'll add the jitted assembly.

ghost commented 8 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Description In https://github.com/dotnet/runtime/issues/75455#issuecomment-1263821647 I got help to diagnose that due to how SqlServer.Types calls native code, it runs much slower with W^X enabled. With .NET 8 I re-ran my benchmarks with W^X disabled and saw a 20% regression compared to .NET 7. [STIntersection.zip](https://github.com/dotnet/runtime/files/13726984/STIntersection.zip) Do you have an idea what might have caused this performance regression and if it's "by design"? ### Configuration To make as fair a comparison between .NET 7 and .NET 8, I disabled Dynamic PGO, DATAS and W^X. ``` BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3803/22H2/2022Update) Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-alpha.1.23619.5 [Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2 Job-VCKSWC : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2 Job-KIDGWI : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2 Job-WWREOB : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2 Job-ROIKLT : .NET 9.0.0 (9.0.23.61807), X64 RyuJIT AVX2 EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0 Platform=X64 Server=True ``` ### Regression? Yes. The earliest .NET 8 runtime I could find was Preview1 which also had the regression. If earlier runtime builds are available, I'd be happy to bisect this further. ### Data | Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size | |--------------- |------------------ |---------:|---------:|---------:|------:|--------:|----------:| | STIntersection | .NET 7.0 | 78.36 us | 1.019 us | 0.953 us | 1.00 | 0.00 | 7,411 B | | STIntersection | .NET 8.0-Preview1 | 94.64 us | 0.939 us | 0.878 us | 1.21 | 0.02 | 6,980 B | | STIntersection | .NET 8.0 | 96.58 us | 1.565 us | 1.675 us | 1.23 | 0.03 | 7,377 B | | STIntersection | .NET 9.0 | 95.33 us | 1.588 us | 1.486 us | 1.22 | 0.03 | 7,152 B |

Author:	jnyrup
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `area-VM-coreclr`, `untriaged`
Milestone:	-

jakobbotsch commented 8 months ago

I see no difference on my machine:

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-APOFVS : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
  Job-TLKCDS : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0  Platform=X64  Server=True

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	51.37 us	0.307 us	0.288 us	1.00	0.00	7,411 B
STIntersection	.NET 8.0	51.01 us	0.303 us	0.284 us	0.99	0.01	6,980 B

Since you are on a Comet Lake CPU this could potentially be JCC erratum. I'll try to track down the codegen diffs.

jakobbotsch commented 8 months ago

Oddly almost all time on the benchmark is spent inside native code according to https://github.com/AndyAyersMS/InstructionsRetiredExplorer:

Samples for dotnet: 29422 events for Benchmark Intervals
Jitting           : 00.00% 0        samples 670 methods
  JitInterface    : 00.00% 0        samples
Jit-generated code: 00.74% 9.5E+05  samples
  Jitted code     : 00.74% 9.5E+05  samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.74% 9.5E+05  samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 00.00% 0        samples
  R2R code        : 00.00% 0        samples

81.48%   1.042E+08   native   SqlServerSpatial140.dll
10.51%   1.344E+07   native   msvcr120.dll
03.99%   5.11E+06    native   ntdll.dll
02.68%   3.43E+06    native   coreclr.dll
00.20%   2.6E+05     FullOpt  [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData)
00.17%   2.2E+05     native   ntoskrnl.exe
00.13%   1.6E+05     native   System.Private.CoreLib.dll
00.06%   8E+04       native   kernel32.dll
00.05%   7E+04       FullOpt  [Microsoft.SqlServer.Types]SridList.GetEllipsoidParameters(int32)
00.05%   7E+04       FullOpt  [Microsoft.SqlServer.Types]GeoData.CreateArrays(int32,int32,int32,int32,int32,int32)

Benchmark: found 15 intervals; mean interval 852.813ms

I'm not really sure how that's possible. Can you try running this on your machine? To do so run BDN with -p ETW to produce a .etl trace file and run the tool with <tool.exe> <path to ETL file produced> -benchmark.

The overall diffs I see with and without early liveness (DOTNET_JitEnableEarlyLivenessRange=0 in checked builds) are

Diffs are based on 146 contexts (0 MinOpts, 146 FullOpts). Base JIT options: JitEnableEarlyLivenessRange=0

Overall (-1,134 bytes)

|Collection|Base size (bytes)|Diff size (bytes)| |---|--:|--:| |col.mch|74,273|-1,134|

FullOpts (-1,134 bytes)

|Collection|Base size (bytes)|Diff size (bytes)| |---|--:|--:| |col.mch|74,273|-1,134|

Example diffs

col.mch

-39 (-46.43%) : 72.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts)

```diff @@ -3,64 +3,50 @@ ; FullOpts code ; optimized code ; rsp based frame -; partially interruptible +; fully interruptible ; No matching PGO data ; Final local variable assignments ; ; V00 this [V00,T01] ( 3, 3 ) ref -> rcx this class-hnd single-def -; V01 arg1 [V01,T04] ( 3, 3 ) double -> mm1 single-def -; V02 arg2 [V02,T05] ( 3, 3 ) double -> mm2 single-def +; V01 arg1 [V01,T03] ( 3, 3 ) double -> mm0 single-def +; V02 arg2 [V02,T04] ( 3, 3 ) double -> mm2 single-def ; V03 arg3 [V03,T00] ( 3, 6 ) byref -> r9 single-def -; V04 arg4 [V04,T03] ( 1, 2 ) byref -> r11 single-def -; V05 OutArgs [V05 ] ( 1, 1 ) struct (40) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" +; V04 arg4 [V04,T02] ( 1, 2 ) byref -> r11 single-def +;# V05 OutArgs [V05 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V06 tmp1 [V06 ] ( 0, 0 ) ubyte -> zero-ref "field V03.hasValue (fldOffset=0x0)" P-INDEP ;* V07 tmp2 [V07 ] ( 0, 0 ) double -> zero-ref "field V03.value (fldOffset=0x8)" P-INDEP ;* V08 tmp3 [V08 ] ( 0, 0 ) ubyte -> zero-ref "field V04.hasValue (fldOffset=0x0)" P-INDEP ;* V09 tmp4 [V09 ] ( 0, 0 ) double -> zero-ref "field V04.value (fldOffset=0x8)" P-INDEP ;* V10 tmp5 [V10 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" ;* V11 tmp6 [V11 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" -; V12 tmp7 [V12 ] ( 2, 4 ) struct (16) [rsp+0x38] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V13 tmp8 [V13 ] ( 2, 4 ) struct (16) [rsp+0x28] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V14 tmp9 [V14,T02] ( 2, 4 ) ref -> rcx single-def "argument with side effect" -; V15 tmp10 [V15,T06] ( 2, 4 ) double -> mm2 "argument with side effect" -; V16 tmp11 [V16,T07] ( 2, 4 ) double -> mm0 "argument with side effect" ; -; Lcl frame size = 72 +; Lcl frame size = 0 G_M13633_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG - sub rsp, 72 vzeroupper - mov r11, bword ptr [rsp+0x70] + vmovaps xmm0, xmm1 + mov r11, bword ptr [rsp+0x28] ; byrRegs +[r11] ;; size=12 bbWeight=1 PerfScore 2.25 G_M13633_IG02: ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0A00 {r9 r11}, byref ; gcrRegs +[rcx] ; byrRegs +[r9] + nop + ;; size=1 bbWeight=1 PerfScore 0.25 +G_M13633_IG03: ; bbWeight=1, nogc, extend + mov bword ptr [rsp+0x28], r11 mov rcx, gword ptr [rcx+0x08] - vmovaps xmm0, xmm1 - vmovups xmm1, xmmword ptr [r9] - vmovups xmmword ptr [rsp+0x38], xmm1 - vmovups xmm1, xmmword ptr [r11] - vmovups xmmword ptr [rsp+0x28], xmm1 - lea r9, [rsp+0x28] - ; byrRegs -[r9] - mov qword ptr [rsp+0x20], r9 vmovaps xmm1, xmm2 vmovaps xmm2, xmm0 - lea r9, [rsp+0x38] mov r11, 0xD1FFAB1E ; code for ; byrRegs -[r11] - call [r11] - ; gcrRegs -[rcx] - ; gcr arg pop 0 - nop - ;; size=67 bbWeight=1 PerfScore 18.25 -G_M13633_IG03: ; bbWeight=1, epilog, nogc, extend - add rsp, 72 - ret - ;; size=5 bbWeight=1 PerfScore 1.25 + cmp dword ptr [rcx], ecx + ;; size=29 bbWeight=1 PerfScore 6.75 +G_M13633_IG04: ; bbWeight=1, epilog, nogc, extend + tail.jmp [r11] + ;; size=3 bbWeight=1 PerfScore 2.00 -; Total bytes of code 84, prolog size 7, PerfScore 30.15, instruction count 19, allocated bytes for code 84 (MethodHash=0b04cabe) for method Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) +; Total bytes of code 45, prolog size 12, PerfScore 15.75, instruction count 11, allocated bytes for code 45 (MethodHash=0b04cabe) for method Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) ; ============================================================ Unwind Info: @@ -68,9 +54,8 @@ Unwind Info: >> End offset : 0xd1ffab1e (not in unwind data) Version : 1 Flags : 0x00 - SizeOfProlog : 0x04 - CountOfUnwindCodes: 1 + SizeOfProlog : 0x00 + CountOfUnwindCodes: 0 FrameRegister : none (0) FrameOffset : N/A (no FrameRegister) (Value=0) UnwindCodes : - CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 8 * 8 + 8 = 72 = 0x48 ```

-39 (-46.43%) : 56.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts)

```diff @@ -3,64 +3,50 @@ ; FullOpts code ; optimized code ; rsp based frame -; partially interruptible +; fully interruptible ; No matching PGO data ; Final local variable assignments ; ; V00 this [V00,T01] ( 3, 3 ) ref -> rcx this class-hnd single-def -; V01 arg1 [V01,T04] ( 3, 3 ) double -> mm1 single-def -; V02 arg2 [V02,T05] ( 3, 3 ) double -> mm2 single-def +; V01 arg1 [V01,T03] ( 3, 3 ) double -> mm0 single-def +; V02 arg2 [V02,T04] ( 3, 3 ) double -> mm2 single-def ; V03 arg3 [V03,T00] ( 3, 6 ) byref -> r9 single-def -; V04 arg4 [V04,T03] ( 1, 2 ) byref -> r11 single-def -; V05 OutArgs [V05 ] ( 1, 1 ) struct (40) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" +; V04 arg4 [V04,T02] ( 1, 2 ) byref -> r11 single-def +;# V05 OutArgs [V05 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V06 tmp1 [V06 ] ( 0, 0 ) ubyte -> zero-ref "field V03.hasValue (fldOffset=0x0)" P-INDEP ;* V07 tmp2 [V07 ] ( 0, 0 ) double -> zero-ref "field V03.value (fldOffset=0x8)" P-INDEP ;* V08 tmp3 [V08 ] ( 0, 0 ) ubyte -> zero-ref "field V04.hasValue (fldOffset=0x0)" P-INDEP ;* V09 tmp4 [V09 ] ( 0, 0 ) double -> zero-ref "field V04.value (fldOffset=0x8)" P-INDEP ;* V10 tmp5 [V10 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" ;* V11 tmp6 [V11 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" -; V12 tmp7 [V12 ] ( 2, 4 ) struct (16) [rsp+0x38] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V13 tmp8 [V13 ] ( 2, 4 ) struct (16) [rsp+0x28] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V14 tmp9 [V14,T02] ( 2, 4 ) ref -> rcx single-def "argument with side effect" -; V15 tmp10 [V15,T06] ( 2, 4 ) double -> mm2 "argument with side effect" -; V16 tmp11 [V16,T07] ( 2, 4 ) double -> mm0 "argument with side effect" ; -; Lcl frame size = 72 +; Lcl frame size = 0 G_M41379_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG - sub rsp, 72 vzeroupper - mov r11, bword ptr [rsp+0x70] + vmovaps xmm0, xmm1 + mov r11, bword ptr [rsp+0x28] ; byrRegs +[r11] ;; size=12 bbWeight=1 PerfScore 2.25 G_M41379_IG02: ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0A00 {r9 r11}, byref ; gcrRegs +[rcx] ; byrRegs +[r9] + nop + ;; size=1 bbWeight=1 PerfScore 0.25 +G_M41379_IG03: ; bbWeight=1, nogc, extend + mov bword ptr [rsp+0x28], r11 mov rcx, gword ptr [rcx+0x08] - vmovaps xmm0, xmm1 - vmovups xmm1, xmmword ptr [r9] - vmovups xmmword ptr [rsp+0x38], xmm1 - vmovups xmm1, xmmword ptr [r11] - vmovups xmmword ptr [rsp+0x28], xmm1 - lea r9, [rsp+0x28] - ; byrRegs -[r9] - mov qword ptr [rsp+0x20], r9 vmovaps xmm1, xmm2 vmovaps xmm2, xmm0 - lea r9, [rsp+0x38] mov r11, 0xD1FFAB1E ; code for ; byrRegs -[r11] - call [r11] - ; gcrRegs -[rcx] - ; gcr arg pop 0 - nop - ;; size=67 bbWeight=1 PerfScore 18.25 -G_M41379_IG03: ; bbWeight=1, epilog, nogc, extend - add rsp, 72 - ret - ;; size=5 bbWeight=1 PerfScore 1.25 + cmp dword ptr [rcx], ecx + ;; size=29 bbWeight=1 PerfScore 6.75 +G_M41379_IG04: ; bbWeight=1, epilog, nogc, extend + tail.jmp [r11] + ;; size=3 bbWeight=1 PerfScore 2.00 -; Total bytes of code 84, prolog size 7, PerfScore 30.15, instruction count 19, allocated bytes for code 84 (MethodHash=7d155e5c) for method Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) +; Total bytes of code 45, prolog size 12, PerfScore 15.75, instruction count 11, allocated bytes for code 45 (MethodHash=7d155e5c) for method Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) ; ============================================================ Unwind Info: @@ -68,9 +54,8 @@ Unwind Info: >> End offset : 0xd1ffab1e (not in unwind data) Version : 1 Flags : 0x00 - SizeOfProlog : 0x04 - CountOfUnwindCodes: 1 + SizeOfProlog : 0x00 + CountOfUnwindCodes: 0 FrameRegister : none (0) FrameOffset : N/A (no FrameRegister) (Value=0) UnwindCodes : - CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 8 * 8 + 8 = 72 = 0x48 ```

-21 (-21.88%) : 145.dasm - System.ConsolePal+WindowsConsoleStream:Write(System.ReadOnlySpan`1[ubyte]):this (FullOpts)

```diff @@ -7,51 +7,40 @@ ; No matching PGO data ; Final local variable assignments ; -; V00 this [V00,T01] ( 4, 4 ) ref -> r8 this class-hnd single-def +; V00 this [V00,T01] ( 4, 4 ) ref -> rcx this class-hnd single-def ; V01 arg1 [V01,T00] ( 3, 6 ) byref -> rdx single-def -; V02 loc0 [V02,T03] ( 3, 2 ) int -> rbx +; V02 loc0 [V02,T02] ( 3, 2 ) int -> rbx ; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V04 tmp1 [V04 ] ( 0, 0 ) byref -> zero-ref "field V01._reference (fldOffset=0x0)" P-INDEP ;* V05 tmp2 [V05 ] ( 0, 0 ) int -> zero-ref "field V01._length (fldOffset=0x8)" P-INDEP ;* V06 tmp3 [V06 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" -; V07 tmp4 [V07 ] ( 2, 4 ) struct (16) [rsp+0x20] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V08 tmp5 [V08,T02] ( 2, 4 ) long -> rcx "argument with side effect" -; V09 tmp6 [V09,T04] ( 2, 0 ) ref -> rdx single-def "argument with side effect" +; V07 tmp4 [V07,T03] ( 2, 0 ) ref -> rdx single-def "argument with side effect" ; -; Lcl frame size = 48 +; Lcl frame size = 32 G_M34519_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG push rbx - sub rsp, 48 - vzeroupper - mov r8, rcx - ; gcrRegs +[r8] - ;; size=11 bbWeight=1 PerfScore 2.50 -G_M34519_IG02: ; bbWeight=1, gcrefRegs=0100 {r8}, byrefRegs=0004 {rdx}, byref + sub rsp, 32 + ;; size=5 bbWeight=1 PerfScore 1.25 +G_M34519_IG02: ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0004 {rdx}, byref, isz + ; gcrRegs +[rcx] ; byrRegs +[rdx] - mov rcx, qword ptr [r8+0x18] - ;; size=4 bbWeight=1 PerfScore 2.00 -G_M34519_IG03: ; bbWeight=1, nogc, extend - vmovdqu xmm0, xmmword ptr [rdx] - vmovdqu xmmword ptr [rsp+0x20], xmm0 - ;; size=10 bbWeight=1 PerfScore 5.00 -G_M34519_IG04: ; bbWeight=1, isz, extend - lea rdx, [rsp+0x20] - ; byrRegs -[rdx] - movzx r8, byte ptr [r8+0x13] - ; gcrRegs -[r8] + movzx r8, byte ptr [rcx+0x13] + mov rcx, qword ptr [rcx+0x18] + ; gcrRegs -[rcx] call [System.ConsolePal+WindowsConsoleStream:WriteFileNative(long,System.ReadOnlySpan`1[ubyte],ubyte):int] + ; byrRegs -[rdx] ; gcr arg pop 0 mov ebx, eax test ebx, ebx - jne SHORT G_M34519_IG06 - ;; size=22 bbWeight=1 PerfScore 7.00 -G_M34519_IG05: ; bbWeight=1, epilog, nogc, extend - add rsp, 48 + jne SHORT G_M34519_IG04 + ;; size=21 bbWeight=1 PerfScore 8.50 +G_M34519_IG03: ; bbWeight=1, epilog, nogc, extend + add rsp, 32 pop rbx ret ;; size=6 bbWeight=1 PerfScore 1.75 -G_M34519_IG06: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref +G_M34519_IG04: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref mov ecx, 181 mov rdx, 0xD1FFAB1E call CORINFO_HELP_STRCNS @@ -73,7 +62,7 @@ G_M34519_IG06: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 { int3 ;; size=43 bbWeight=0 PerfScore 0.00 -; Total bytes of code 96, prolog size 8, PerfScore 27.85, instruction count 26, allocated bytes for code 96 (MethodHash=2ba77928) for method System.ConsolePal+WindowsConsoleStream:Write(System.ReadOnlySpan`1[ubyte]):this (FullOpts) +; Total bytes of code 75, prolog size 5, PerfScore 19.00, instruction count 21, allocated bytes for code 75 (MethodHash=2ba77928) for method System.ConsolePal+WindowsConsoleStream:Write(System.ReadOnlySpan`1[ubyte]):this (FullOpts) ; ============================================================ Unwind Info: @@ -86,5 +75,5 @@ Unwind Info: FrameRegister : none (0) FrameOffset : N/A (no FrameRegister) (Value=0) UnwindCodes : - CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 5 * 8 + 8 = 48 = 0x30 + CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 3 * 8 + 8 = 32 = 0x20 CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbx (3) ```

+0 (0.00%) : 46.dasm - Microsoft.SqlServer.Types.Validator:BeginGeo(ubyte):this (FullOpts)

```diff @@ -9,22 +9,22 @@ ; ; V00 this [V00,T00] ( 19, 17 ) ref -> rbx this class-hnd single-def ; V01 arg1 [V01,T02] ( 5, 3 ) ubyte -> rsi single-def -; V02 loc0 [V02,T04] ( 4, 0 ) ref -> r14 class-hnd exact single-def <> -; V03 loc1 [V03,T05] ( 4, 0 ) ref -> r14 class-hnd exact single-def <> -; V04 loc2 [V04,T03] ( 2, 2 ) ubyte -> rcx single-def +; V02 loc0 [V02,T03] ( 4, 0 ) ref -> r14 class-hnd exact single-def <> +; V03 loc1 [V03,T04] ( 4, 0 ) ref -> r14 class-hnd exact single-def <> +;* V04 loc2 [V04 ] ( 0, 0 ) ubyte -> zero-ref ; V05 OutArgs [V05 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" -; V06 tmp1 [V06,T10] ( 2, 0 ) ref -> rdi class-hnd single-def "non-inline candidate call" <> -; V07 tmp2 [V07,T11] ( 2, 0 ) ref -> rbp class-hnd exact single-def "impAppendStmt" <> -; V08 tmp3 [V08,T12] ( 2, 0 ) ref -> rdx class-hnd exact single-def "Strict ordering of exceptions for Array store" <> -; V09 tmp4 [V09,T06] ( 3, 0 ) ref -> rax class-hnd exact single-def "Single-def Box Helper" <> -; V10 tmp5 [V10,T07] ( 3, 0 ) ref -> rbx class-hnd exact single-def "NewObj constructor temp" <> -; V11 tmp6 [V11,T13] ( 2, 0 ) ref -> rdi class-hnd single-def "non-inline candidate call" <> -; V12 tmp7 [V12,T14] ( 2, 0 ) ref -> rbp class-hnd exact single-def "impAppendStmt" <> -; V13 tmp8 [V13,T15] ( 2, 0 ) ref -> rdx class-hnd exact single-def "Strict ordering of exceptions for Array store" <> -; V14 tmp9 [V14,T08] ( 3, 0 ) ref -> rax class-hnd exact single-def "Single-def Box Helper" <> -; V15 tmp10 [V15,T09] ( 3, 0 ) ref -> rbx class-hnd exact single-def "NewObj constructor temp" <> -; V16 tmp11 [V16,T16] ( 2, 0 ) ref -> rdx single-def "argument with side effect" -; V17 tmp12 [V17,T17] ( 2, 0 ) ref -> rdx single-def "argument with side effect" +; V06 tmp1 [V06,T09] ( 2, 0 ) ref -> rdi class-hnd single-def "non-inline candidate call" <> +; V07 tmp2 [V07,T10] ( 2, 0 ) ref -> rbp class-hnd exact single-def "impAppendStmt" <> +; V08 tmp3 [V08,T11] ( 2, 0 ) ref -> rdx class-hnd exact single-def "Strict ordering of exceptions for Array store" <> +; V09 tmp4 [V09,T05] ( 3, 0 ) ref -> rax class-hnd exact single-def "Single-def Box Helper" <> +; V10 tmp5 [V10,T06] ( 3, 0 ) ref -> rbx class-hnd exact single-def "NewObj constructor temp" <> +; V11 tmp6 [V11,T12] ( 2, 0 ) ref -> rdi class-hnd single-def "non-inline candidate call" <> +; V12 tmp7 [V12,T13] ( 2, 0 ) ref -> rbp class-hnd exact single-def "impAppendStmt" <> +; V13 tmp8 [V13,T14] ( 2, 0 ) ref -> rdx class-hnd exact single-def "Strict ordering of exceptions for Array store" <> +; V14 tmp9 [V14,T07] ( 3, 0 ) ref -> rax class-hnd exact single-def "Single-def Box Helper" <> +; V15 tmp10 [V15,T08] ( 3, 0 ) ref -> rbx class-hnd exact single-def "NewObj constructor temp" <> +; V16 tmp11 [V16,T15] ( 2, 0 ) ref -> rdx single-def "argument with side effect" +; V17 tmp12 [V17,T16] ( 2, 0 ) ref -> rdx single-def "argument with side effect" ; V18 rat0 [V18,T01] ( 3, 6 ) int -> rcx "ReplaceWithLclVar is creating a new local variable" ; ; Lcl frame size = 32 ```

+0 (0.00%) : 55.dasm - Microsoft.SqlServer.Types.WellKnownTextReader:RecognizeOptionalDouble():System.Nullable`1[double]:this (FullOpts)

```diff @@ -13,11 +13,11 @@ ;* V02 loc0 [V02 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op ; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V04 tmp1 [V04 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "NewObj constructor temp" -; V05 tmp2 [V05,T03] ( 2, 2 ) double -> mm0 "Inlining Arg" +;* V05 tmp2 [V05 ] ( 0, 0 ) double -> zero-ref "Inlining Arg" ;* V06 tmp3 [V06 ] ( 0, 0 ) ubyte -> zero-ref single-def "field V02.hasValue (fldOffset=0x0)" P-INDEP ;* V07 tmp4 [V07 ] ( 0, 0 ) double -> zero-ref single-def "field V02.value (fldOffset=0x8)" P-INDEP -;* V08 tmp5 [V08,T02] ( 0, 0 ) ubyte -> zero-ref "field V04.hasValue (fldOffset=0x0)" P-INDEP -; V09 tmp6 [V09,T04] ( 2, 1 ) double -> mm0 "field V04.value (fldOffset=0x8)" P-INDEP +;* V08 tmp5 [V08,T02] ( 0, 0 ) ubyte -> zero-ref single-def "field V04.hasValue (fldOffset=0x0)" P-INDEP +; V09 tmp6 [V09,T03] ( 2, 1 ) double -> mm0 single-def "field V04.value (fldOffset=0x8)" P-INDEP ; ; Lcl frame size = 40 ```

+0 (0.00%) : 27.dasm - Microsoft.SqlServer.Types.Validator:Execute(int):this (FullOpts)

```diff @@ -25,27 +25,27 @@ ;* V14 loc12 [V14 ] ( 0, 0 ) int -> zero-ref ;* V15 loc13 [V15 ] ( 0, 0 ) int -> zero-ref ;* V16 loc14 [V16 ] ( 0, 0 ) int -> zero-ref -;* V17 loc15 [V17 ] ( 0, 0 ) int -> zero-ref -;* V18 loc16 [V18 ] ( 0, 0 ) int -> zero-ref -;* V19 loc17 [V19 ] ( 0, 0 ) int -> zero-ref -;* V20 loc18 [V20 ] ( 0, 0 ) int -> zero-ref -;* V21 loc19 [V21 ] ( 0, 0 ) int -> zero-ref -;* V22 loc20 [V22 ] ( 0, 0 ) int -> zero-ref -;* V23 loc21 [V23 ] ( 0, 0 ) int -> zero-ref -;* V24 loc22 [V24 ] ( 0, 0 ) int -> zero-ref -;* V25 loc23 [V25 ] ( 0, 0 ) int -> zero-ref -;* V26 loc24 [V26 ] ( 0, 0 ) int -> zero-ref -;* V27 loc25 [V27 ] ( 0, 0 ) int -> zero-ref -;* V28 loc26 [V28 ] ( 0, 0 ) int -> zero-ref -;* V29 loc27 [V29 ] ( 0, 0 ) int -> zero-ref -;* V30 loc28 [V30 ] ( 0, 0 ) int -> zero-ref -;* V31 loc29 [V31 ] ( 0, 0 ) int -> zero-ref -;* V32 loc30 [V32 ] ( 0, 0 ) int -> zero-ref -;* V33 loc31 [V33 ] ( 0, 0 ) int -> zero-ref -;* V34 loc32 [V34 ] ( 0, 0 ) int -> zero-ref -;* V35 loc33 [V35 ] ( 0, 0 ) int -> zero-ref -;* V36 loc34 [V36 ] ( 0, 0 ) int -> zero-ref -;* V37 loc35 [V37 ] ( 0, 0 ) int -> zero-ref +;* V17 loc15 [V17 ] ( 0, 0 ) int -> zero-ref single-def +;* V18 loc16 [V18 ] ( 0, 0 ) int -> zero-ref single-def +;* V19 loc17 [V19 ] ( 0, 0 ) int -> zero-ref single-def +;* V20 loc18 [V20 ] ( 0, 0 ) int -> zero-ref single-def +;* V21 loc19 [V21 ] ( 0, 0 ) int -> zero-ref single-def +;* V22 loc20 [V22 ] ( 0, 0 ) int -> zero-ref single-def +;* V23 loc21 [V23 ] ( 0, 0 ) int -> zero-ref single-def +;* V24 loc22 [V24 ] ( 0, 0 ) int -> zero-ref single-def +;* V25 loc23 [V25 ] ( 0, 0 ) int -> zero-ref single-def +;* V26 loc24 [V26 ] ( 0, 0 ) int -> zero-ref single-def +;* V27 loc25 [V27 ] ( 0, 0 ) int -> zero-ref single-def +;* V28 loc26 [V28 ] ( 0, 0 ) int -> zero-ref single-def +;* V29 loc27 [V29 ] ( 0, 0 ) int -> zero-ref single-def +;* V30 loc28 [V30 ] ( 0, 0 ) int -> zero-ref single-def +;* V31 loc29 [V31 ] ( 0, 0 ) int -> zero-ref single-def +;* V32 loc30 [V32 ] ( 0, 0 ) int -> zero-ref single-def +;* V33 loc31 [V33 ] ( 0, 0 ) int -> zero-ref single-def +;* V34 loc32 [V34 ] ( 0, 0 ) int -> zero-ref single-def +;* V35 loc33 [V35 ] ( 0, 0 ) int -> zero-ref single-def +;* V36 loc34 [V36 ] ( 0, 0 ) int -> zero-ref single-def +;* V37 loc35 [V37 ] ( 0, 0 ) int -> zero-ref single-def ; V38 OutArgs [V38 ] ( 1, 1 ) struct (56) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ; V39 tmp1 [V39,T195] ( 2, 0 ) ref -> rdi class-hnd single-def "non-inline candidate call" <> ; V40 tmp2 [V40,T196] ( 2, 0 ) ref -> rbp class-hnd exact single-def "impAppendStmt" <> @@ -67,10 +67,10 @@ ; V56 tmp18 [V56,T193] ( 3, 0 ) ref -> rsi class-hnd exact "NewObj constructor temp" <> ; V57 tmp19 [V57,T205] ( 2, 0 ) ref -> rbx class-hnd "non-inline candidate call" <> ; V58 tmp20 [V58,T194] ( 3, 0 ) ref -> rsi class-hnd exact "NewObj constructor temp" <> -; V59 tmp21 [V59,T07] ( 3, 24 ) int -> rax "Inline return value spill temp" +;* V59 tmp21 [V59 ] ( 0, 0 ) int -> zero-ref "Inline return value spill temp" ; V60 tmp22 [V60,T03] ( 4, 48 ) ref -> rax class-hnd "Inlining Arg" ; V61 tmp23 [V61,T08] ( 3, 24 ) int -> rdx "Inline stloc first use temp" -; V62 tmp24 [V62,T06] ( 3, 24 ) ref -> r8 class-hnd "Inline stloc first use temp" <> +; V62 tmp24 [V62,T07] ( 3, 24 ) ref -> r8 class-hnd "Inline stloc first use temp" <> ; V63 tmp25 [V63,T16] ( 7, 7 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V64 tmp26 [V64,T87] ( 4, 2 ) int -> rdx single-def "Inline stloc first use temp" ; V65 tmp27 [V65,T105] ( 3, 1.50) ref -> rax class-hnd single-def "Inline stloc first use temp" <> @@ -185,18 +185,18 @@ ; V174 tmp136 [V174,T53] ( 7, 7 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V175 tmp137 [V175,T84] ( 5, 2.50) int -> rdx single-def "Inline stloc first use temp" ; V176 tmp138 [V176,T142] ( 3, 1.50) ref -> rax class-hnd single-def "Inline stloc first use temp" <> -; V177 tmp139 [V177,T05] ( 7, 28 ) ref -> rcx class-hnd single-def "Inlining Arg" +; V177 tmp139 [V177,T06] ( 7, 28 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V178 tmp140 [V178,T14] ( 5, 10 ) int -> rdx single-def "Inline stloc first use temp" -; V179 tmp141 [V179,T15] ( 4, 8 ) ref -> rax class-hnd "Inline stloc first use temp" <> +; V179 tmp141 [V179,T15] ( 4, 8 ) ref -> rax class-hnd single-def "Inline stloc first use temp" <> ; V180 tmp142 [V180,T54] ( 7, 7 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V181 tmp143 [V181,T85] ( 5, 2.50) int -> rdx single-def "Inline stloc first use temp" ; V182 tmp144 [V182,T143] ( 3, 1.50) ref -> rax class-hnd single-def "Inline stloc first use temp" <> ; V183 tmp145 [V183,T55] ( 7, 7 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V184 tmp146 [V184,T86] ( 5, 2.50) int -> rdx single-def "Inline stloc first use temp" ; V185 tmp147 [V185,T144] ( 3, 1.50) ref -> rax class-hnd single-def "Inline stloc first use temp" <> -; V186 tmp148 [V186,T04] ( 7, 44 ) ref -> rcx class-hnd single-def "Inlining Arg" +; V186 tmp148 [V186,T05] ( 7, 44 ) ref -> rcx class-hnd single-def "Inlining Arg" ; V187 tmp149 [V187,T11] ( 5, 14 ) int -> rdx single-def "Inline stloc first use temp" -; V188 tmp150 [V188,T12] ( 4, 12 ) ref -> rax class-hnd "Inline stloc first use temp" <> +; V188 tmp150 [V188,T12] ( 4, 12 ) ref -> rax class-hnd single-def "Inline stloc first use temp" <> ; V189 tmp151 [V189,T206] ( 2, 0 ) ref -> rdx single-def "argument with side effect" ; V190 tmp152 [V190,T207] ( 2, 0 ) ref -> rdi "argument with side effect" ; V191 tmp153 [V191,T208] ( 2, 0 ) ref -> r8 "argument with side effect" @@ -265,16 +265,17 @@ ; V254 cse40 [V254,T168] ( 3, 1.50) int -> r8 "CSE - conservative" ; V255 cse41 [V255,T02] ( 58, 49 ) ref -> rcx "CSE - aggressive" ; V256 cse42 [V256,T10] ( 2, 16 ) int -> r10 "CSE - aggressive" -; V257 rat0 [V257,T57] ( 3, 3 ) int -> rsi "ReplaceWithLclVar is creating a new local variable" -; V258 rat1 [V258,T58] ( 3, 3 ) int -> rdi "ReplaceWithLclVar is creating a new local variable" -; V259 rat2 [V259,T103] ( 2, 2 ) int -> rax "ReplaceWithLclVar is creating a new local variable" -; V260 rat3 [V260,T59] ( 3, 3 ) int -> rbp "ReplaceWithLclVar is creating a new local variable" -; V261 rat4 [V261,T60] ( 3, 3 ) int -> r14 "ReplaceWithLclVar is creating a new local variable" -; V262 rat5 [V262,T61] ( 3, 3 ) int -> r15 "ReplaceWithLclVar is creating a new local variable" -; V263 rat6 [V263,T62] ( 3, 3 ) int -> r13 "ReplaceWithLclVar is creating a new local variable" -; V264 rat7 [V264,T104] ( 2, 2 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" -; V265 rat8 [V265,T13] ( 3, 12 ) int -> r12 "ReplaceWithLclVar is creating a new local variable" -; V266 rat9 [V266,T09] ( 3, 24 ) int -> rax "ReplaceWithLclVar is creating a new local variable" +; V257 rat0 [V257,T04] ( 3, 48 ) int -> rax "ReplaceWithLclVar is creating a new local variable" +; V258 rat1 [V258,T57] ( 3, 3 ) int -> rsi "ReplaceWithLclVar is creating a new local variable" +; V259 rat2 [V259,T58] ( 3, 3 ) int -> rdi "ReplaceWithLclVar is creating a new local variable" +; V260 rat3 [V260,T103] ( 2, 2 ) int -> rax "ReplaceWithLclVar is creating a new local variable" +; V261 rat4 [V261,T59] ( 3, 3 ) int -> rbp "ReplaceWithLclVar is creating a new local variable" +; V262 rat5 [V262,T60] ( 3, 3 ) int -> r14 "ReplaceWithLclVar is creating a new local variable" +; V263 rat6 [V263,T61] ( 3, 3 ) int -> r15 "ReplaceWithLclVar is creating a new local variable" +; V264 rat7 [V264,T62] ( 3, 3 ) int -> r13 "ReplaceWithLclVar is creating a new local variable" +; V265 rat8 [V265,T104] ( 2, 2 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" +; V266 rat9 [V266,T13] ( 3, 12 ) int -> r12 "ReplaceWithLclVar is creating a new local variable" +; V267 rat10 [V267,T09] ( 3, 24 ) int -> rax "ReplaceWithLclVar is creating a new local variable" ; ; Lcl frame size = 56 ```

Details

#### Improvements/regressions per collection |Collection|Contexts with diffs|Improvements|Regressions|Same size|Improvements (bytes)|Regressions (bytes)| |---|--:|--:|--:|--:|--:|--:| |col.mch|49|39|0|10|-1,134|+0| --- #### Context information |Collection|Diffed contexts|MinOpts|FullOpts|Missed, base|Missed, diff| |---|--:|--:|--:|--:|--:| |col.mch|146|0|146|0 (0.00%)|0 (0.00%)| --- #### jit-analyze output

col.mch

To reproduce these diffs on Windows x64: ``` superpmi.py asmdiffs -target_os windows -target_arch x64 -arch x64 ``` ``` Summary of Code Size diffs: (Lower is better) Total bytes of base: 74273 (overridden on cmd) Total bytes of diff: 73139 (overridden on cmd) Total bytes of delta: -1134 (-1.53 % of base) diff is an improvement. relative diff is an improvement. ```

Detail diffs

``` Top file improvements (bytes): -221 : 112.dasm (-24.13% of base) -119 : 110.dasm (-31.48% of base) -90 : 114.dasm (-8.25% of base) -60 : 75.dasm (-6.62% of base) -53 : 101.dasm (-2.94% of base) -48 : 102.dasm (-12.94% of base) -44 : 3.dasm (-12.79% of base) -43 : 59.dasm (-10.97% of base) -39 : 56.dasm (-46.43% of base) -39 : 72.dasm (-46.43% of base) -37 : 100.dasm (-10.72% of base) -36 : 65.dasm (-17.31% of base) -34 : 2.dasm (-13.82% of base) -30 : 124.dasm (-3.72% of base) -29 : 60.dasm (-17.58% of base) -27 : 73.dasm (-16.17% of base) -27 : 57.dasm (-16.17% of base) -24 : 74.dasm (-14.91% of base) -24 : 58.dasm (-14.55% of base) -21 : 145.dasm (-21.88% of base) 36 total files with Code Size differences (36 improved, 0 regressed), 10 unchanged. Top method improvements (bytes): -221 (-24.13% of base) : 112.dasm - Microsoft.SqlServer.Types.GLNativeMethods:GeodeticCombine(int,Microsoft.SqlServer.Types.GeoData,Microsoft.SqlServer.Types.GeoData,double):Microsoft.SqlServer.Types.GeoData (FullOpts) -119 (-31.48% of base) : 110.dasm - Microsoft.SqlServer.Types.SqlGeography:STIntersection(Microsoft.SqlServer.Types.SqlGeography):Microsoft.SqlServer.Types.SqlGeography:this (FullOpts) -90 (-8.25% of base) : 114.dasm - (dynamicClass):IL_STUB_PInvoke(int,Microsoft.SqlServer.Types.GeoMarshalData,Microsoft.SqlServer.Types.GeoMarshalData,double,Microsoft.SqlServer.Types.GeoDataPinningAllocator):int (FullOpts) -60 (-6.62% of base) : 75.dasm - Microsoft.SqlServer.Types.GeoDataBuilder:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -53 (-2.94% of base) : 101.dasm - Microsoft.SqlServer.Types.GeoDataPinner:Pin(Microsoft.SqlServer.Types.GeoData):Microsoft.SqlServer.Types.GeoMarshalData:this (FullOpts) -48 (-12.94% of base) : 102.dasm - (dynamicClass):IL_STUB_PInvoke(Microsoft.SqlServer.Types.GeoMarshalData,double,ubyte,byref,byref):int (FullOpts) -44 (-12.79% of base) : 3.dasm - Microsoft.SqlServer.Types.SqlGeography:Parse(System.Data.SqlTypes.SqlString):Microsoft.SqlServer.Types.SqlGeography (FullOpts) -43 (-10.97% of base) : 59.dasm - Microsoft.SqlServer.Types.GeographyValidator:ValidatePoint(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -39 (-46.43% of base) : 72.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -39 (-46.43% of base) : 56.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -37 (-10.72% of base) : 100.dasm - Microsoft.SqlServer.Types.GLNativeMethods:GeodeticIsValid(byref,double,ubyte):ubyte (FullOpts) -36 (-17.31% of base) : 65.dasm - Microsoft.SqlServer.Types.GeoDataBuilder:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -34 (-13.82% of base) : 2.dasm - Benchmark:.cctor() (FullOpts) -30 (-3.72% of base) : 124.dasm - Microsoft.SqlServer.Types.GeoDataPinner:GetGeoData():Microsoft.SqlServer.Types.GeoData:this (FullOpts) -29 (-17.58% of base) : 60.dasm - Microsoft.SqlServer.Types.Validator:ValidatePoint(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -27 (-16.17% of base) : 73.dasm - Microsoft.SqlServer.Types.ForwardingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -27 (-16.17% of base) : 57.dasm - Microsoft.SqlServer.Types.ForwardingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -24 (-14.91% of base) : 74.dasm - Microsoft.SqlServer.Types.Validator:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -24 (-14.55% of base) : 58.dasm - Microsoft.SqlServer.Types.Validator:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -21 (-21.88% of base) : 145.dasm - System.ConsolePal+WindowsConsoleStream:Write(System.ReadOnlySpan`1[ubyte]):this (FullOpts) Top method improvements (percentages): -39 (-46.43% of base) : 72.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -39 (-46.43% of base) : 56.dasm - Microsoft.SqlServer.Types.CoordinateReversingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -119 (-31.48% of base) : 110.dasm - Microsoft.SqlServer.Types.SqlGeography:STIntersection(Microsoft.SqlServer.Types.SqlGeography):Microsoft.SqlServer.Types.SqlGeography:this (FullOpts) -221 (-24.13% of base) : 112.dasm - Microsoft.SqlServer.Types.GLNativeMethods:GeodeticCombine(int,Microsoft.SqlServer.Types.GeoData,Microsoft.SqlServer.Types.GeoData,double):Microsoft.SqlServer.Types.GeoData (FullOpts) -21 (-21.88% of base) : 145.dasm - System.ConsolePal+WindowsConsoleStream:Write(System.ReadOnlySpan`1[ubyte]):this (FullOpts) -29 (-17.58% of base) : 60.dasm - Microsoft.SqlServer.Types.Validator:ValidatePoint(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -36 (-17.31% of base) : 65.dasm - Microsoft.SqlServer.Types.GeoDataBuilder:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -27 (-16.17% of base) : 73.dasm - Microsoft.SqlServer.Types.ForwardingGeoDataSink:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -27 (-16.17% of base) : 57.dasm - Microsoft.SqlServer.Types.ForwardingGeoDataSink:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -24 (-14.91% of base) : 74.dasm - Microsoft.SqlServer.Types.Validator:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -24 (-14.55% of base) : 58.dasm - Microsoft.SqlServer.Types.Validator:BeginFigure(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -34 (-13.82% of base) : 2.dasm - Benchmark:.cctor() (FullOpts) -48 (-12.94% of base) : 102.dasm - (dynamicClass):IL_STUB_PInvoke(Microsoft.SqlServer.Types.GeoMarshalData,double,ubyte,byref,byref):int (FullOpts) -44 (-12.79% of base) : 3.dasm - Microsoft.SqlServer.Types.SqlGeography:Parse(System.Data.SqlTypes.SqlString):Microsoft.SqlServer.Types.SqlGeography (FullOpts) -43 (-10.97% of base) : 59.dasm - Microsoft.SqlServer.Types.GeographyValidator:ValidatePoint(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) -37 (-10.72% of base) : 100.dasm - Microsoft.SqlServer.Types.GLNativeMethods:GeodeticIsValid(byref,double,ubyte):ubyte (FullOpts) -6 (-9.09% of base) : 96.dasm - Microsoft.SqlServer.Types.GeoData:IsEmpty():ubyte:this (FullOpts) -90 (-8.25% of base) : 114.dasm - (dynamicClass):IL_STUB_PInvoke(int,Microsoft.SqlServer.Types.GeoMarshalData,Microsoft.SqlServer.Types.GeoMarshalData,double,Microsoft.SqlServer.Types.GeoDataPinningAllocator):int (FullOpts) -8 (-6.90% of base) : 106.dasm - Microsoft.SqlServer.Types.GeoData:ContainsCurvedShapes():ubyte:this (FullOpts) -60 (-6.62% of base) : 75.dasm - Microsoft.SqlServer.Types.GeoDataBuilder:AddLine(double,double,System.Nullable`1[double],System.Nullable`1[double]):this (FullOpts) ```

--------------------------------------------------------------------------------

The diffs look good, but I'm not really sure where to focus my investigation to look for potential JCC erratum.

jnyrup commented 8 months ago

Added BenchmarkDotNet.Diagnostics.Windows and [EtwProfiler] to the test project and ran:

dotnet build -c Release -f net8.0
.\bin\Release\net8.0\win-x64\STIntersection.exe -p ETW

Analyzing the produced .etl file using InstructionsRetiredExplorer gives:

Mining ETL from C:\dev\stintersection\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-20240114-171746.etl for process dotnet
Found process [22864] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:McEtIVxF32pD7Svj7l1O9o2MUcz1FGqXcUUCENE92z4"
PMC interval now 10000
Found process [11156] dotnet: "dotnet" bc400ac9-1bf8-410d-aece-32f227398082.dll --anonymousPipes 1332 1348 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 8.0, Server=True, Toolchain=.NET 8.0" --benchmarkId 0

==> benchmark process is [11156]

eh? unknown module ID 140704149553152
eh? unknown module ID 140704149553152
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FF83CFE0BF8<class System.Object>)
Samples for dotnet: 17977 events for Benchmark Intervals
Jitting           : 00,00% 0        samples 670 methods
  JitInterface    : 00,00% 0        samples
Jit-generated code: 00,57% 6E+05    samples
  Jitted code     : 00,57% 6E+05    samples
  MinOpts code    : 00,00% 0        samples
  FullOpts code   : 00,57% 6E+05    samples
  Tier-0 code     : 00,00% 0        samples
  Tier-1 code     : 00,00% 0        samples
  R2R code        : 00,00% 0        samples

67,84%   7,1E+07     native   SqlServerSpatial140.dll
25,26%   2,644E+07   native   msvcr120.dll
03,15%   3,3E+06     native   ntdll.dll
02,66%   2,78E+06    native   coreclr.dll
00,21%   2,2E+05     native   System.Private.CoreLib.dll
00,08%   8E+04       FullOpt  [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32)
00,08%   8E+04       native   ntoskrnl.exe
00,07%   7E+04       FullOpt  [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData)
00,06%   6E+04       FullOpt  [Perfolizer]dynamicClass.IL_STUB_PInvoke(pMT: 00007FF83D543050,pMT: 00007FF83D542B88,pMT: 00007FF83D542B88,float64,pMT: 00007FF83D5431F0)

Benchmark: found 15 intervals; mean interval 698,169ms

I'll try finding a colleague on Monday with a newer Intel CPU (but otherwise similar setup) and have them re-run the benchmarks.

jnyrup commented 8 months ago

Running this on a colleague's machine with a Raptor Lake CPU. If it can make any difference, he's running Windows 11 while I'm on Windows 10. We both have Sophos anti-virus installed which can be a CPU hog, but from looking in the Task Manager it seems inactive during the benchmarks.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3007/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.101
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  Job-EAAWAC : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
  Job-YZXMWS : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0  Platform=X64  Server=True

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD
STIntersection	.NET 7.0	41.14 us	0.778 us	0.833 us	1.00	0.00
STIntersection	.NET 8.0	219.47 us	2.816 us	2.634 us	5.32	0.12

Mining ETL from C:\dev\Jonas\stintersection\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-.NET 8.0-20240115-102154.etl for process dotnet
Found process [28256] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:VTsk2TzcFC7Bd1HOtZmxdfIvwWm1YzhZJp3qZT3CiFc"
Found process [29456] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:Le5PLwrPKmY2ZTI3rRdBpEzf4K6hBP+l7pIZ7kU4ZdA"
PMC interval now 10000
Found process [19268] dotnet: "dotnet" 2891f347-4d17-4526-b7e7-5d550281f902.dll --anonymousPipes 1680 1744 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 8.0, Server=True, Toolchain=.NET 8.0" --benchmarkId 0

==> benchmark process is [19268]

eh? unknown module ID 140714432282624
eh? unknown module ID 140714432282624
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,pMT: 00007FFAA1E40BF8<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FFAA1E40BF8<class System.Object>)
Samples for dotnet: 23445 events for Benchmark Intervals
Jitting           : 00,00% 0        samples 670 methods
  JitInterface    : 00,00% 0        samples
Jit-generated code: 00,18% 2,4E+05  samples
  Jitted code     : 00,18% 2,4E+05  samples
  MinOpts code    : 00,00% 0        samples
  FullOpts code   : 00,18% 2,4E+05  samples
  Tier-0 code     : 00,00% 0        samples
  Tier-1 code     : 00,00% 0        samples
  R2R code        : 00,00% 0        samples

49,97%   6,796E+07   native   SqlServerSpatial140.dll
47,84%   6,507E+07   native   msvcr120.dll
01,06%   1,44E+06    native   ntdll.dll
00,61%   8,3E+05     native   coreclr.dll
00,23%   3,1E+05     native   ntoskrnl.exe
00,07%   9E+04       FullOpt  [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32)
00,05%   7E+04       native   System.Private.CoreLib.dll

Benchmark: found 15 intervals; mean interval 907,085ms

jakobbotsch commented 8 months ago

Since the benchmark spends no time in managed code it seems very unlikely to be a codegen issue, but that's odd since you bisected it to a JIT change above. You can also try comparing the output of InstructionsRetiredExplorer for the .NET 7 and .NET 8 versions to see if there's any difference in where the benchmark is spending time.

Can you run the micro benchmark under a profiler in the base and diff to see exactly where the difference is coming from? For example, you can look at the two .etl files in PerfView, or you can try running the simplified benchmark under vTune.

jnyrup commented 8 months ago

InstructionsRetiredExplorer on the ETL files produced by BDN

.NET 7.0.15

``` Mining ETL from .\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-.NET 7.0-20240115-115914.etl for process dotnet Found process [19052] dotnet: "C:\Program Files\dotnet\dotnet.exe" run -c Release -f net8.0 Found process [8748] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:McEtIVxF32pD7Svj7l1O9o2MUcz1FGqXcUUCENE92z4" PMC interval now 10000 Found process [22168] dotnet: "dotnet" 710d23d8-a1d5-4e13-b757-f784fa658813.dll --anonymousPipes 1772 1796 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 7.0, Server=True, Toolchain=.NET 7.0" --benchmarkId 0 ==> benchmark process is [22168] eh? unknown module ID 140703154716672 eh? unknown module ID 140703154716672 eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*) Samples for dotnet: 15794 events for Benchmark Intervals Jitting : 00,00% 0 samples 673 methods JitInterface : 00,00% 0 samples Jit-generated code: 00,90% 8E+05 samples Jitted code : 00,90% 8E+05 samples MinOpts code : 00,00% 0 samples FullOpts code : 00,90% 8E+05 samples Tier-0 code : 00,00% 0 samples Tier-1 code : 00,00% 0 samples R2R code : 00,00% 0 samples 00,46% 4,1E+05 ? Unknown 76,03% 6,744E+07 native SqlServerSpatial140.dll 10,89% 9,66E+06 native msvcr120.dll 07,93% 7,03E+06 native ntdll.dll 03,08% 2,73E+06 native coreclr.dll 00,34% 3E+05 native System.Private.CoreLib.dll 00,29% 2,6E+05 native ntoskrnl.exe 00,10% 9E+04 FullOpt [Perfolizer]dynamicClass.IL_STUB_PInvoke(pMT: 00007FF802058BE8,pMT: 00007FF8020584A8,pMT: 00007FF8020584A8,float64,pMT: 00007FF802058E38) 00,10% 9E+04 FullOpt [Perfolizer]dynamicClass.IL_STUB_StructMarshal(int8&,int8*,int32,pMT: 00007FF8020595F0&) 00,10% 9E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32) 00,07% 6E+04 FullOpt [Microsoft.SqlServer.Types]SqlGeography.set_Srid(int32) 00,07% 6E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData) 00,06% 5E+04 FullOpt [Microsoft.SqlServer.Types]SqlGeography.STIntersection(class Microsoft.SqlServer.Types.SqlGeography) 00,06% 5E+04 FullOpt [Microsoft.SqlServer.Types]GLNativeMethods.GeodeticCombine(value class CombineMode,value class Microsoft.SqlServer.Types.GeoData,value class Microsoft.SqlServer.Types.GeoData,float64) Benchmark: found 15 intervals; mean interval 591,976ms ```

.NET 8.0.1

``` Mining ETL from .\BenchmarkDotNet.Artifacts\Benchmark.STIntersection-.NET 8.0-20240115-115914.etl for process dotnet Found process [19052] dotnet: "C:\Program Files\dotnet\dotnet.exe" run -c Release -f net8.0 Found process [8748] dotnet: "C:\Program Files\dotnet\dotnet.exe" exec "C:\Program Files\dotnet\sdk\8.0.101\Roslyn\bincore\VBCSCompiler.dll" "-pipename:McEtIVxF32pD7Svj7l1O9o2MUcz1FGqXcUUCENE92z4" PMC interval now 10000 Found process [23900] dotnet: "dotnet" aa0a0509-1850-433a-8a9a-d0b47e200d9c.dll --anonymousPipes 1904 1920 --benchmarkName Benchmark.STIntersection --job "EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0, Platform=X64, Runtime=.NET 8.0, Server=True, Toolchain=.NET 8.0" --benchmarkId 0 ==> benchmark process is [23900] eh? unknown module ID 140704652017664 eh? unknown module ID 140704652017664 eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,pMT: 00007FF85AF10BF8) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,pMT: 00007FF85AF10BF8) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,pMT: 00007FF85AF10BF8) eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FF85AF10BF8) Samples for dotnet: 19682 events for Benchmark Intervals Jitting : 00,00% 0 samples 670 methods JitInterface : 00,00% 0 samples Jit-generated code: 00,56% 5,8E+05 samples Jitted code : 00,56% 5,8E+05 samples MinOpts code : 00,00% 0 samples FullOpts code : 00,56% 5,8E+05 samples Tier-0 code : 00,00% 0 samples Tier-1 code : 00,00% 0 samples R2R code : 00,00% 0 samples 00,27% 2,8E+05 ? Unknown 67,67% 7,04E+07 native SqlServerSpatial140.dll 25,02% 2,603E+07 native msvcr120.dll 03,35% 3,48E+06 native ntdll.dll 02,59% 2,69E+06 native coreclr.dll 00,25% 2,6E+05 native ntoskrnl.exe 00,22% 2,3E+05 native System.Private.CoreLib.dll 00,09% 9E+04 FullOpt [Perfolizer]dynamicClass.IL_STUB_StructMarshal(int8&,int8*,int32,pMT: 00007FF85B473930&) 00,07% 7E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinningAllocator.AllocAndPinGeometry(int32,int32,int32,int32) 00,06% 6E+04 FullOpt [Microsoft.SqlServer.Types]GeoDataPinner.Pin(value class Microsoft.SqlServer.Types.GeoData) Benchmark: found 15 intervals; mean interval 694,246ms ```

I tried running the simplified benchmarks under VTune using the corerun built from the two commits. Here's the overview of comparing the "microarchitecture exploration" runs (which the tool suggested).

net7.zip net8.zip

jakobbotsch commented 8 months ago

VTune/PerfView should be able to tell you more detailed which methods are taking up the time of the benchmark (you may need the .pdb for the native library). If you can reproduce a 5x difference in performance it should hopefully be very apparent where the difference is coming from.

jnyrup commented 8 months ago

I don't have access to a .pdb file for the native library, so I tried another approach where I decompiled the managed assembly using ILSpy such that I could fiddle with the components used in GLNativeMethods.GeodeticCombine that handles calling the native code. By changing whether GeoDataPinningAllocator should be a class or a struct and whether is has a bool _disposed field for safe double-disposal I got different results.

stintersection.zip

Benchmarking 7.0.15 vs 8.0.1 using BDN base:	Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
STIntersection	.NET 7.0	69.83 us	0.596 us	0.529 us	1.00	7,419 B
STIntersection	.NET 8.0	83.82 us	0.793 us	0.742 us	1.20	6,898 B

Delete `bool _disposed` and use `if (_gchGeoDataPinner.IsAllocated)` in `Dispose()`	Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
STIntersection	.NET 7.0	69.44 us	0.956 us	0.894 us	1.00	7,408 B
STIntersection	.NET 8.0	67.95 us	0.838 us	0.784 us	0.98	6,882 B

Change `GeoDataPinningAllocator` from `class` to `struct`	Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	84.52 us	1.652 us	1.545 us	1.00	0.00	7,442 B
STIntersection	.NET 8.0	83.97 us	0.651 us	0.609 us	0.99	0.02	6,930 B

Change `GeoDataPinningAllocator` from `class` to `struct` Delete `bool _disposed` and use `if (_gchGeoDataPinner.IsAllocated)` in `Dispose()`	Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	67.57 us	0.520 us	0.461 us	1.00	0.00	7,428 B
STIntersection	.NET 8.0	68.70 us	1.078 us	0.900 us	1.02	0.02	6,919 B

Running the simplified benchmark using the custom built corerun base: 7c265c396e6: 66.4 µs db717e30839: 81.5 µs

Delete bool _disposed and use if (_gchGeoDataPinner.IsAllocated) in Dispose() 7c265c396e6: 67.0 µs db717e30839: 67.2 µs

Change GeoDataPinningAllocator from class to struct 7c265c396e6: 81.2 µs db717e30839: 82.0 µs

Change GeoDataPinningAllocator from class to struct Delete bool _disposed and use if (_gchGeoDataPinner.IsAllocated) in Dispose() 7c265c396e6: 68.4 µs db717e30839: 67.4 µs

jakobbotsch commented 8 months ago

Thanks for sharing that. I noticed one interesting data point in your image from vtune above, namely the "Mixing Vectors" row which seems to suggest the diff is doing something odd that the base isn't. I wonder if it is related to our insertion of vzeroupper. I am unable to open the vtune project you shared, but perhaps you can dig into the "Mixing Vectors" there and see what it is about?

I'll try to see if I can get access to an Intel machine and reproduce it. In the meantime @tannergooding perhaps has an idea of what "Mixing Vectors" means here and whether or not this could be about vzeroupper insertion (in particular whether a vzeroupper issue would make sense given that it only seems to reproduce on Intel CPUs but not AMD ones).

jnyrup commented 7 months ago

The tooltip of "Mixing Vectors"

The Intel manual specifically mentions vzeroupper

Seemingly related intel community post

AndyAyersMS commented 7 months ago

I wonder if this is something similar to https://github.com/dotnet/runtime/issues/95954#issuecomment-1856906695 (though some different native entrypoint perhaps).

Can you try running with AVX codegen disabled?

set DOTNET_EnableAVX=0 in your environment, or for BDN, add --envVars DOTNET_EnableAVX:0 (and run tests via BenchmarkSwitcher so it parses the command line).

Couple of questions:

are you building your native code so it matches the capabilities of the CPU it is running on? Any assembly language bits lurking in there?
in your test I see tired comp, pgo, etc disabled -- was that just to isolate the perf issue or do you actually deploy things that way?

jnyrup commented 7 months ago

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
  [Host]     : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
  Job-CYKLDY : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
  Job-NPDFDR : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0  Platform=X64  Server=True

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	72.37 us	1.230 us	1.091 us	1.00	0.00	7,411 B
STIntersection	.NET 8.0	88.76 us	0.704 us	0.624 us	1.23	0.02	6,980 B

Setting DOTNET_EnableAVX=0

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
  [Host]     : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
  Job-OULJAX : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT SSE4.2
  Job-LKAEUU : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT SSE4.2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0,DOTNET_EnableAVX=0  Platform=X64  Server=True

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
STIntersection	.NET 7.0	72.93 us	1.291 us	1.207 us	1.00	0.00	7,409 B
STIntersection	.NET 8.0	73.13 us	1.335 us	1.249 us	1.00	0.02	6,985 B

Setting DOTNET_EnableAVX2=0 (In case that be interesting)

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3930/22H2/2022Update)
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.200-preview.23624.5
  [Host]     : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
  Job-CRQRKC : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX
  Job-OASVKT : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0,DOTNET_EnableWriteXorExecute=0,DOTNET_TieredPGO=0,DOTNET_TieredCompilation=0,DOTNET_TC_QuickJit=0,DOTNET_EnableAVX2=0  Platform=X64  Server=True

Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
STIntersection	.NET 7.0	73.33 us	1.068 us	0.999 us	1.00	7,411 B
STIntersection	.NET 8.0	88.21 us	0.610 us	0.540 us	1.21	6,980 B

are you building your native code so it matches the capabilities of the CPU it is running on? Any assembly language bits lurking in there?

I don't have access to the source of the native code, it comes with https://www.nuget.org/packages/Microsoft.SqlServer.Types/14.0.1016.290

in your test I see tired comp, pgo, etc disabled -- was that just to isolate the perf issue or do you actually deploy things that way?

We don't deploy using any of these settings, those were just enabled to isolate the perf issue.

tannergooding commented 7 months ago

This can be closed as a duplicate of https://github.com/dotnet/runtime/issues/82132

The simple fix is we should be emitting vzeroupper before transferring control to anything which may be "AVX unaware" (P/Invokes and some R2R methods) and therefore which could use the legacy encoded instructions.

On modern hardware, vzeroupper tends to be free (handled in register renaming); and as noted above the transition penalty is already not expensive for AMD; but it may still incur cost on older hardware which can be a net negative for method calls which don't use floating-point/SIMD at all.

We have a separate issue (#11496) tracking our existing overuse of vzeroupper in other areas, which itself really only needs to be before or after such transition boundaries, not as part of every managed function. This is because there is no penalty going from 128-bit legacy <-> 128-bit VEX only when going between 128-bit legacy <-> 256-bit or higher VEX/EVEX, the diagram for that (in the worst case) is:

tannergooding commented 7 months ago

vzeroupper fix is https://github.com/dotnet/runtime/pull/98261, no longer see the regression locally with the fix

dotnet / runtime