Open stephentoub opened 3 weeks ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
I assume the diff is on the left? I see calls to EqualsIgnoreCase_Vector
in that version that don't exist in the base version, so it also looks like there were some changes in the C# code here.
so it also looks like there were some changes in the C# code here
Presumably https://github.com/dotnet/runtime/pull/93116
From a layout standpoint the only thing that stands is the placement of the G_M000_IG14
code. If it turns out it is warm then keeping it in line like we did before is likely an improvement.
We will need to see some sample annotation to understand better. Annoyingly perf
doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.
Annoyingly perf doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.
FWIW, the regression for me is on Windows.
Annoyingly perf doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.
FWIW, the regression for me is on Windows.
Ok. The linked diff was on unix... let me get an updated diff first.
I can repro at least...
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24406.3 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-SBTAVT : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-FSPBZZ : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 277.7 us | 1.00 |
Contains_Iterate | .NET 9.0 | 357.2 us | 1.29 |
In case you try adding -p ETW
to the BDN command line to sample, and get an error like
Unhandled exception. System.Runtime.InteropServices.COMException (0x800700AA): The requested resource is in use. (0x800700AA)
it is probably https://github.com/dotnet/BenchmarkDotNet/issues/2537 (aka https://github.com/microsoft/perfview/issues/1723) ... apparently windows defender can tie up the kernel session.
Here's the final block layout on .NET 9:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 0.99 16 [000..00F)-> BB08(1) (always) i IBC idxlen nullcheck
BB08 [0001] 2 BB01,BB27 21.92 351 [00F..019)-> BB10(1) (always) i IBC loophead idxlen bwd bwd-target
BB10 [0002] 2 BB08,BB25 170.04 2721 [019..01F)-> BB11(1),BB31(0) ( cond ) i IBC loophead idxlen bwd bwd-target
BB11 [0022] 1 BB10 170.04 2721 [01E..01F)-> BB13(0.2),BB12(0.8) ( cond ) i IBC bwd
BB12 [0027] 1 BB11 136.03 2176 [01E..01F)-> BB13(1) (always) i IBC hascall gcsafe bwd
BB13 [0028] 2 BB11,BB12 170.04 2721 [01E..01F)-> BB15(0.48),BB14(0.52) ( cond ) i IBC idxlen bwd
BB14 [0031] 1 BB13 88.42 1415 [01E..01F)-> BB18(1) (always) i IBC bwd
BB18 [0033] 2 BB14,BB15 170.04 2721 [000..000)-> BB25(0.00825),BB19(0.992) ( cond ) i IBC idxlen bwd
BB19 [0058] 1 BB18 168.63 2698 [000..000)-> BB21(0.899),BB20(0.101) ( cond ) i IBC internal bwd
BB21 [0066] 1 BB19 151.53 2424 [000..000)-> BB23(0.332),BB22(0.668) ( cond ) i IBC internal bwd
BB22 [0071] 1 BB21 101.11 1618 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB24 [0073] 3 BB20,BB22,BB23 168.63 2698 [000..035)-> BB25(0.995),BB32(0.00501) ( cond ) i IBC idxlen bwd
BB25 [0059] 2 BB18,BB24 170.04 2721 [000..041)-> BB10(0.875),BB27(0.125) ( cond ) i IBC idxlen bwd
BB27 [0078] 1 BB25 25.19 403 [041..04F)-> BB08(0.994),BB29(0.00586) ( cond ) i IBC bwd
BB29 [0079] 1 BB27 0.15 2 [04F..051) (return) i IBC
BB15 [0032] 1 BB13 81.62 1306 [01E..01F)-> BB18(1) (always) i IBC idxlen nullcheck bwd
BB23 [0072] 1 BB21 50.36 806 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB20 [0065] 1 BB19 17.09 273 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB32 [0003] 1 BB24 0.85 14 [035..037) (return) i IBC
BB31 [0021] 1 BB10 0 0 [01E..01F) (throw ) i IBC rare hascall gcsafe bwd
BB33 [0080] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
The profile data suggests there are a few conditional branches in the inner loop where both successors are hot (BB13
, BB21
, and to a lesser degree, BB19
). The initial RPO layout keeps the successors in-line, like so:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 0.99 16 [000..00F)-> BB08(1) (always) i IBC idxlen nullcheck
BB08 [0001] 2 BB01,BB27 21.92 351 [00F..019)-> BB10(1) (always) i IBC loophead idxlen bwd bwd-target
BB10 [0002] 2 BB08,BB25 170.04 2721 [019..01F)-> BB11(1),BB31(0) ( cond ) i IBC loophead idxlen bwd bwd-target
BB11 [0022] 1 BB10 170.04 2721 [01E..01F)-> BB13(0.2),BB12(0.8) ( cond ) i IBC bwd
BB12 [0027] 1 BB11 136.03 2176 [01E..01F)-> BB13(1) (always) i IBC hascall gcsafe bwd
BB13 [0028] 2 BB11,BB12 170.04 2721 [01E..01F)-> BB15(0.48),BB14(0.52) ( cond ) i IBC idxlen bwd
BB14 [0031] 1 BB13 88.42 1415 [01E..01F)-> BB18(1) (always) i IBC bwd
BB15 [0032] 1 BB13 81.62 1306 [01E..01F)-> BB18(1) (always) i IBC idxlen nullcheck bwd
BB18 [0033] 2 BB14,BB15 170.04 2721 [000..000)-> BB25(0.00825),BB19(0.992) ( cond ) i IBC idxlen bwd
BB19 [0058] 1 BB18 168.63 2698 [000..000)-> BB21(0.899),BB20(0.101) ( cond ) i IBC internal bwd
BB21 [0066] 1 BB19 151.53 2424 [000..000)-> BB23(0.332),BB22(0.668) ( cond ) i IBC internal bwd
BB22 [0071] 1 BB21 101.11 1618 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB23 [0072] 1 BB21 50.36 806 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB20 [0065] 1 BB19 17.09 273 [000..000)-> BB24(1) (always) i IBC internal hascall gcsafe bwd
BB24 [0073] 3 BB20,BB22,BB23 168.63 2698 [000..035)-> BB25(0.995),BB32(0.00501) ( cond ) i IBC idxlen bwd
BB32 [0003] 1 BB24 0.85 14 [035..037) (return) i IBC
BB25 [0059] 2 BB18,BB24 170.04 2721 [000..041)-> BB10(0.875),BB27(0.125) ( cond ) i IBC idxlen bwd
BB27 [0078] 1 BB25 25.19 403 [041..04F)-> BB08(0.994),BB29(0.00586) ( cond ) i IBC bwd
BB29 [0079] 1 BB27 0.15 2 [04F..051) (return) i IBC
BB31 [0021] 1 BB10 0 0 [01E..01F) (throw ) i IBC rare hascall gcsafe bwd
BB33 [0080] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Then, Compiler::fgMoveHotJumps
prioritizes the hottest path, thus forcing the less-likely successors to the end of the method. I suspect all those newly-introduced backward jumps are the source of the regression. I'll get a .NET 8 dump to compare this with shortly.
Profiling suggests that the Scalar
method is the biggest culprit here (though all 3 methods end up slower).
Note I am seeing some run to run variation when profiling, sometimes the regression is only about 4%. So there could be a data alignment issue here too.
dotnet run -c Release -f net8.0 -- --runtimes net8.0 net9.0 -p ETW -f * --apples --iterationCount 20
NET 8
39.45% 5.188E+07 Tier-1 [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector128(wchar&,wchar&,int32)
37.22% 4.894E+07 Tier-1 [bench]Tests.Contains_Iterate()
22.83% 3.002E+07 Tier-1 [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32)
00.21% 2.8E+05 native clrjit.dll
00.12% 1.6E+05 native coreclr.dll
00.11% 1.4E+05 native ntoskrnl.exe
Benchmark: found 20 intervals; mean interval 655.585ms
NET 9
35.10% 5.627E+07 Tier-1 [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector(wchar&,wchar&,int32)
33.50% 5.37E+07 Tier-1 [bench]Tests.Contains_Iterate()
30.97% 4.964E+07 Tier-1 [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32)
00.16% 2.6E+05 native clrjit.dll
00.14% 2.2E+05 native coreclr.dll
00.07% 1.2E+05 native ntoskrnl.exe
Benchmark: found 20 intervals; mean interval 799.078ms
Here's the .NET 8 layout. The old dump formatting isn't as explicit: If you don't see a jump target or type listed, the block falls into the next block. For conditional blocks, the true target is listed, and the false target is always the next block.
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 10 [000..00F)-> BB28 ( cond ) i idxlen nullcheck IBC
BB07 [0072] 2 BB01,BB32 11587. 115866 [00F..019) i idxlen bwd IBC
BB10 [0002] 2 BB07,BB26 80987. 809866 1 [019..01F)-> BB29 ( cond ) i Loop Loop0 idxlen bwd bwd-target IBC
BB11 [0022] 1 BB10 80987. 809866 1 [01E..01F)-> BB13 ( cond ) i bwd IBC
BB12 [0027] 1 BB11 1296k 1 [01E..01F) i hascall gcsafe bwd
BB13 [0028] 2 BB11,BB12 80987. 809866 1 [01E..01F)-> BB15 ( cond ) i idxlen bwd IBC
BB14 [0031] 1 BB13 1296k 1 [01E..01F)-> BB18 (always) i bwd
BB15 [0032] 1 BB13 1296k 1 [01E..01F) i idxlen nullcheck bwd
BB18 [0033] 2 BB14,BB15 80987. 809866 1 [000..000)-> BB23 ( cond ) i idxlen bwd IBC
BB19 [0058] 1 BB18 80966. 809661 1 [000..000)-> BB21 ( cond ) i internal bwd IBC
BB20 [0065] 1 BB19 53582. 535821 1 [000..000) i internal hascall gcsafe bwd IBC
BB22 [0067] 2 BB20,BB21 80987. 809866 1 [000..035)-> BB30 ( cond ) i gcsafe idxlen bwd IBC
BB26 [0004] 2 BB22,BB23 80987. 809866 1 [037..041)-> BB10 ( cond ) i idxlen bwd IBC
BB27 [0006] 1 BB26 11587. 115866 0 [041..04F)-> BB28 ( cond ) i bwd IBC
BB32 [0074] 1 BB27 11587. 0 [???..???)-> BB07 (always) internal
BB21 [0066] 1 BB19 27083. 270831 1 [000..000)-> BB22 (always) i internal hascall gcsafe bwd IBC
BB23 [0059] 1 BB18 321.46 3215 1 [000..000)-> BB26 (always) i internal bwd IBC
BB30 [0070] 1 BB22 321.46 3215 [035..037) (return) i IBC
BB28 [0008] 2 BB01,BB27 0 0 [04F..051) (return) i rare IBC
BB29 [0021] 1 BB10 0 0 [01E..01F) (throw ) i rare hascall gcsafe bwd IBC
-----------------------------------------------------------------------------------------------------------------------------------------
The paths are much more interleaved. It's not obvious to me which one is better: more fallthrough, or fewer jumps out and back into the loop.
In case you try adding
-p ETW
to the BDN command line to sample, and get an error likeUnhandled exception. System.Runtime.InteropServices.COMException (0x800700AA): The requested resource is in use. (0x800700AA)
it is probably dotnet/BenchmarkDotNet#2537 (aka https://github.com/microsoft/perfview/issues/1723) ... apparently windows defender can tie up the kernel session.
This sounds very much like Defender TDT, but it seems that folks are having trouble disabling it. I'll bring this up with the OS folks again.
Codegen for all 3 methods
There are numerous differences... going to try and correlate sample hits back to the code, but (in release) can't get per-instructions offsets in the disassembly, so it may be a little painful.
Codegen for Scalar
is almost identical -- later block placement changes the size of two jumps (net9 on the right below)
Remainder of the method is not hit with any frequency. So seems like any perf difference here must be some microarchtectural issue.
Going to try running this under WSL and on other boxes.
Not seeing anything like this on other XArch cpus...
BenchmarkDotNet v0.14.0, Ubuntu 20.04.6 LTS (Focal Fossa) WSL Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2 Job-DQFOXV : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2 Job-SFEZPB : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 321.0 us | 1.00 |
Contains_Iterate | .NET 9.0 | 339.3 us | 1.06 |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-BZKTWK : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-DDWZYG : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 216.8 us | 1.00 |
Contains_Iterate | .NET 9.0 | 211.0 us | 0.97 |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) (Hyper-V) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-NOEHZQ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-MOBSYO : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 294.2 us | 1.00 |
Contains_Iterate | .NET 9.0 | 281.5 us | 0.96 |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i9-9900T CPU 2.10GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-VXXHSV : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-FXMEWY : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 346.6 us | 1.00 |
Contains_Iterate | .NET 9.0 | 354.4 us | 1.02 |
@EgorBot -amd -intel --runtimes net8.0 net9.0 --apples --iterationCount 50
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];
[Benchmark]
public bool Contains_Iterate()
{
ReadOnlySpan<char> input = s_input;
for (int i = 0; i < input.Length; i++)
{
foreach (string dow in s_daysOfWeek)
{
if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
}
return false;
}
}
@AndyAyersMS it seems that the bot sees a small regression too (on two different cpus)
For the benchmark method itself it seems credible that the changed layout is causing perf issues. With a checked jit there are several branches that might be hitting JCC errata (the .NET 8 version has none):
G_M30171_IG01: ;; offset=0x0000
00007ffa`b64269a0 push rdi
00007ffa`b64269a1 push rsi
00007ffa`b64269a2 push rbp
00007ffa`b64269a3 push rbx
00007ffa`b64269a4 sub rsp, 40
;; size=8 bbWeight=0.99 PerfScore 4.22
G_M30171_IG02: ;; offset=0x0008
00007ffa`b64269a8 mov rcx, 0x1AA4F400D58 ; const ptr
00007ffa`b64269b2 mov rbx, gword ptr [rcx]
00007ffa`b64269b5 add rbx, 12
00007ffa`b64269b9 xor esi, esi
;; size=19 bbWeight=0.99 PerfScore 2.73
G_M30171_IG03: ;; offset=0x001B
00007ffa`b64269bb mov rcx, 0x1AA4F400D60 ; const ptr
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
00007ffa`b64269c5 mov rdi, gword ptr [rcx]
00007ffa`b64269c8 add rdi, 16
00007ffa`b64269cc mov ebp, 7
;; size=22 bbWeight=21.63 PerfScore 59.48
G_M30171_IG04: ;; offset=0x0031
00007ffa`b64269d1 mov rcx, gword ptr [rdi]
00007ffa`b64269d4 cmp esi, 0x3241AE
00007ffa`b64269da ja G_M30171_IG20
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ja: 0 ; jcc erratum) 32B boundary ...............................
00007ffa`b64269e0 mov edx, esi
00007ffa`b64269e2 lea rax, bword ptr [rbx+2*rdx]
00007ffa`b64269e6 mov edx, 0x3241AE
00007ffa`b64269eb sub edx, esi
00007ffa`b64269ed test rcx, rcx
00007ffa`b64269f0 jne SHORT G_M30171_IG14
;; size=33 bbWeight=170.49 PerfScore 980.34
G_M30171_IG05: ;; offset=0x0052
00007ffa`b64269f2 xor r10, r10
00007ffa`b64269f5 xor r8d, r8d
;; size=6 bbWeight=88.66 PerfScore 44.33
G_M30171_IG06: ;; offset=0x0058
00007ffa`b64269f8 cmp r8d, edx
00007ffa`b64269fb jg SHORT G_M30171_IG10
;; size=5 bbWeight=170.49 PerfScore 213.12
G_M30171_IG07: ;; offset=0x005D
00007ffa`b64269fd cmp r8d, 8
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ...............................
00007ffa`b6426a01 jge SHORT G_M30171_IG15
;; size=6 bbWeight=169.06 PerfScore 211.33
G_M30171_IG08: ;; offset=0x0063
00007ffa`b6426a03 mov rcx, rax
00007ffa`b6426a06 mov rdx, r10
00007ffa`b6426a09 call [System.Globalization.Ordinal:EqualsIgnoreCase_Scalar(byref,byref,int):ubyte]
;; size=12 bbWeight=104.67 PerfScore 366.35
G_M30171_IG09: ;; offset=0x006F
00007ffa`b6426a0f test eax, eax
00007ffa`b6426a11 jne SHORT G_M30171_IG18
;; size=4 bbWeight=169.06 PerfScore 211.33
G_M30171_IG10: ;; offset=0x0073
00007ffa`b6426a13 add rdi, 8
00007ffa`b6426a17 dec ebp
00007ffa`b6426a19 jne SHORT G_M30171_IG04
;; size=8 bbWeight=170.49 PerfScore 255.74
G_M30171_IG11: ;; offset=0x007B
00007ffa`b6426a1b inc esi
00007ffa`b6426a1d cmp esi, 0x3241AE
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 3 ; jcc erratum) 32B boundary ...............................
00007ffa`b6426a23 jl SHORT G_M30171_IG03
which could explain the relatively poor results on Coffee Lake / Windows.
Also re-enabling "old layout" in .NET 9 closes the gap:
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24406.3 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-EWCTBT : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-WFMVIE : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_JitDoReversePostOrderLayout=0
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains_Iterate | .NET 8.0 | 347.8 us | 1.00 |
Contains_Iterate | .NET 9.0 | 368.7 us | 1.06 |
And perhaps this causes some sample skid or something into the prologs of the Scalar
and Vector
versions, their first instructions have unusually high counts...
;; net 8
Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32) at 0x00007FFACB5F48E0 -- 0x00007FFACB5F4A60 (length 0x0180)
[0x48E0] 0x0000 : 247
[0x48E1] 0x0001 : 214
[0x48E5] 0x0005 : 118
[0x48E8] 0x0008 : 97
Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector128(wchar&,wchar&,int32) at 0x00007FFACB5F4A80 -- 0x00007FFACB5F4BF0 (length 0x0170)
[0x4A80] 0x0000 : 163
[0x4A81] 0x0001 : 161
[0x4A82] 0x0002 : 20
[0x4A86] 0x0006 : 138
;; net 9
Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32) at 0x00007FFABE78AB00 -- 0x00007FFABE78AC75 (length 0x0175)
[0xAB00] 0x0000 : 467
[0xAB01] 0x0001 : 94
[0xAB05] 0x0005 : 290
[0xAB0B] 0x000B : 90
Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector(wchar&,wchar&,int32) at 0x00007FFABE78ADE0 -- 0x00007FFABE78AF5F (length 0x017F)
[0xADE0] 0x0000 : 483
[0xADE4] 0x0004 : 202
[0xADEA] 0x000A : 866
[0xADF4] 0x0014 : 19
So looks like there is a modest (~10% ish) overall regression (perhaps in the Vector
path) across several different HW models, and then additional more sizeable regressions on older Intel HW from JCC errata, the latter caused (indirectly) by the new block layout.
It also seems possible we're not getting PGO data for the benchmark method (or are getting it and resynthesizing on top), those 48/52 likelihood splits are suspicious... let me look at that next.
The lack of profile data comes from inlining System.String:op_Implicit(System.String):System.ReadOnlySpan`1[ushort]
.
This method is R2R and does tier up:
1557: JIT compiled System.String:op_Implicit(System.String) [Instrumented Tier1, IL size=31, code size=32]
1614: JIT compiled System.String:op_Implicit(System.String) [Tier1, IL size=31, code size=32]
Since Contains_Iterate
has loops and the inner loop iterates a lot, it seems possible we're diverting execution to the OSR methods (which would inline the above) and so we never see the instrumented version get called
669: JIT compiled Tests:Contains_Iterate() [Instrumented Tier0, IL size=81, code size=516]
1516: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x3b with Dynamic PGO, IL size=81, code size=301]
1534: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x45 with Dynamic PGO, IL size=81, code size=253]
1596: JIT compiled Tests:Contains_Iterate() [Instrumented Tier0, IL size=81, code size=465]
1601: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x3b with Dynamic PGO, IL size=81, code size=301]
1611: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x45 with Dynamic PGO, IL size=81, code size=253]
1620: JIT compiled Tests:Contains_Iterate() [Tier1 with Dynamic PGO, IL size=81, code size=209]
Though it seems odd the runtime tells the jit there is no profile data instead of profile data that's all zero (or something).
Looking with VTune (not --apples
so raw numbers not directly comparable) on my Coffee Lake box, the .NET 9 code is much less CPU efficient, and the prominent issue is frontend stalls. CPI grows from 0.322 to 0.399 which is 25%ish...
Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.
https://github.com/dotnet/runtime/issues/93243
Going to move this to future.
Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.
Reading between the lines, does this mean that the front-end stalls you mentioned in your previous comment are because of the JCC errata?
Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.
Reading between the lines, does this mean that the front-end stalls you mentioned in your previous comment are because of the JCC errata?
Yes. With .NET 9 and the new block layout we end up with poorly aligned branches and on older Intel CPUs this incurs performance penalties. It's not the fault of the layout per se; any mitigation we'd do would happen later on. Looking across our benchmark suite, there doesn't seem to be any systematic impact from JCC errata and the new layout, just "random" improvements and regressions on the older Intel machines.
This benchmark regresses by ~25% for me between .NET 8 and .NET 9:
This is on a machine without AVX512.
@EgorBo confirmed he also sees the same regression: "From a quick look at ASM diffs (https://www.diffchecker.com/7Teb6dKS/) it looks like some BB layout reshuffling (Aman/Andy) or strength reduction/IV (Jakob)"
cc: @AndyAyersMS , @amanasifkhalid , @jakobbotsch