25% regression in string search benchmark

stephentoub commented 3 weeks ago

This benchmark regresses by ~25% for me between .NET 8 and .NET 9:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running; 

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); 

[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
    private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];

    [Benchmark]
    public bool Contains_Iterate()
    {
         ReadOnlySpan<char> input = s_input;
         for (int i = 0; i < input.Length; i++)
         {
             foreach (string dow in s_daysOfWeek)
             {
                 if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))
                 {
                     return true;
                 }
             }
         }
         return false;
     }
}

This is on a machine without AVX512.

@EgorBo confirmed he also sees the same regression: "From a quick look at ASM diffs (https://www.diffchecker.com/7Teb6dKS/) it looks like some BB layout reshuffling (Aman/Andy) or strength reduction/IV (Jakob)"

cc: @AndyAyersMS , @amanasifkhalid , @jakobbotsch

dotnet-policy-service[bot] commented 3 weeks ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 3 weeks ago

I assume the diff is on the left? I see calls to EqualsIgnoreCase_Vector in that version that don't exist in the base version, so it also looks like there were some changes in the C# code here.

stephentoub commented 3 weeks ago

so it also looks like there were some changes in the C# code here

Presumably https://github.com/dotnet/runtime/pull/93116

AndyAyersMS commented 3 weeks ago

From a layout standpoint the only thing that stands is the placement of the G_M000_IG14 code. If it turns out it is warm then keeping it in line like we did before is likely an improvement.

We will need to see some sample annotation to understand better. Annoyingly perf doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.

stephentoub commented 3 weeks ago

Annoyingly perf doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.

FWIW, the regression for me is on Windows.

AndyAyersMS commented 3 weeks ago

Annoyingly perf doesn't work under WSL so for that I'll have to find a native linux host. I suppose I can just look at the PGO data.

FWIW, the regression for me is on Windows.

Ok. The linked diff was on unix... let me get an updated diff first.

AndyAyersMS commented 3 weeks ago

I can repro at least...

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24406.3 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-SBTAVT : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-FSPBZZ : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	277.7 us	1.00
Contains_Iterate	.NET 9.0	357.2 us	1.29

AndyAyersMS commented 3 weeks ago

In case you try adding -p ETW to the BDN command line to sample, and get an error like

Unhandled exception. System.Runtime.InteropServices.COMException (0x800700AA): The requested resource is in use. (0x800700AA)

it is probably https://github.com/dotnet/BenchmarkDotNet/issues/2537 (aka https://github.com/microsoft/perfview/issues/1723) ... apparently windows defender can tie up the kernel session.

amanasifkhalid commented 3 weeks ago

Here's the final block layout on .NET 9:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0.99   16 [000..00F)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB08 [0001]  2       BB01,BB27            21.92  351 [00F..019)-> BB10(1)                 (always)                     i IBC loophead idxlen bwd bwd-target
BB10 [0002]  2       BB08,BB25           170.04 2721 [019..01F)-> BB11(1),BB31(0)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB11 [0022]  1       BB10                170.04 2721 [01E..01F)-> BB13(0.2),BB12(0.8)     ( cond )                     i IBC bwd
BB12 [0027]  1       BB11                136.03 2176 [01E..01F)-> BB13(1)                 (always)                     i IBC hascall gcsafe bwd
BB13 [0028]  2       BB11,BB12           170.04 2721 [01E..01F)-> BB15(0.48),BB14(0.52)   ( cond )                     i IBC idxlen bwd
BB14 [0031]  1       BB13                 88.42 1415 [01E..01F)-> BB18(1)                 (always)                     i IBC bwd
BB18 [0033]  2       BB14,BB15           170.04 2721 [000..000)-> BB25(0.00825),BB19(0.992)   ( cond )                     i IBC idxlen bwd
BB19 [0058]  1       BB18                168.63 2698 [000..000)-> BB21(0.899),BB20(0.101) ( cond )                     i IBC internal bwd
BB21 [0066]  1       BB19                151.53 2424 [000..000)-> BB23(0.332),BB22(0.668) ( cond )                     i IBC internal bwd
BB22 [0071]  1       BB21                101.11 1618 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB24 [0073]  3       BB20,BB22,BB23      168.63 2698 [000..035)-> BB25(0.995),BB32(0.00501)   ( cond )                     i IBC idxlen bwd
BB25 [0059]  2       BB18,BB24           170.04 2721 [000..041)-> BB10(0.875),BB27(0.125) ( cond )                     i IBC idxlen bwd
BB27 [0078]  1       BB25                 25.19  403 [041..04F)-> BB08(0.994),BB29(0.00586)   ( cond )                     i IBC bwd
BB29 [0079]  1       BB27                  0.15    2 [04F..051)                           (return)                     i IBC
BB15 [0032]  1       BB13                 81.62 1306 [01E..01F)-> BB18(1)                 (always)                     i IBC idxlen nullcheck bwd
BB23 [0072]  1       BB21                 50.36  806 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB20 [0065]  1       BB19                 17.09  273 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB32 [0003]  1       BB24                  0.85   14 [035..037)                           (return)                     i IBC
BB31 [0021]  1       BB10                  0       0 [01E..01F)                           (throw )                     i IBC rare hascall gcsafe bwd
BB33 [0080]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The profile data suggests there are a few conditional branches in the inner loop where both successors are hot (BB13, BB21, and to a lesser degree, BB19). The initial RPO layout keeps the successors in-line, like so:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0.99   16 [000..00F)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB08 [0001]  2       BB01,BB27            21.92  351 [00F..019)-> BB10(1)                 (always)                     i IBC loophead idxlen bwd bwd-target
BB10 [0002]  2       BB08,BB25           170.04 2721 [019..01F)-> BB11(1),BB31(0)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB11 [0022]  1       BB10                170.04 2721 [01E..01F)-> BB13(0.2),BB12(0.8)     ( cond )                     i IBC bwd
BB12 [0027]  1       BB11                136.03 2176 [01E..01F)-> BB13(1)                 (always)                     i IBC hascall gcsafe bwd
BB13 [0028]  2       BB11,BB12           170.04 2721 [01E..01F)-> BB15(0.48),BB14(0.52)   ( cond )                     i IBC idxlen bwd
BB14 [0031]  1       BB13                 88.42 1415 [01E..01F)-> BB18(1)                 (always)                     i IBC bwd
BB15 [0032]  1       BB13                 81.62 1306 [01E..01F)-> BB18(1)                 (always)                     i IBC idxlen nullcheck bwd
BB18 [0033]  2       BB14,BB15           170.04 2721 [000..000)-> BB25(0.00825),BB19(0.992)   ( cond )                     i IBC idxlen bwd
BB19 [0058]  1       BB18                168.63 2698 [000..000)-> BB21(0.899),BB20(0.101) ( cond )                     i IBC internal bwd
BB21 [0066]  1       BB19                151.53 2424 [000..000)-> BB23(0.332),BB22(0.668) ( cond )                     i IBC internal bwd
BB22 [0071]  1       BB21                101.11 1618 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB23 [0072]  1       BB21                 50.36  806 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB20 [0065]  1       BB19                 17.09  273 [000..000)-> BB24(1)                 (always)                     i IBC internal hascall gcsafe bwd
BB24 [0073]  3       BB20,BB22,BB23      168.63 2698 [000..035)-> BB25(0.995),BB32(0.00501)   ( cond )                     i IBC idxlen bwd
BB32 [0003]  1       BB24                  0.85   14 [035..037)                           (return)                     i IBC
BB25 [0059]  2       BB18,BB24           170.04 2721 [000..041)-> BB10(0.875),BB27(0.125) ( cond )                     i IBC idxlen bwd
BB27 [0078]  1       BB25                 25.19  403 [041..04F)-> BB08(0.994),BB29(0.00586)   ( cond )                     i IBC bwd
BB29 [0079]  1       BB27                  0.15    2 [04F..051)                           (return)                     i IBC
BB31 [0021]  1       BB10                  0       0 [01E..01F)                           (throw )                     i IBC rare hascall gcsafe bwd
BB33 [0080]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Then, Compiler::fgMoveHotJumps prioritizes the hottest path, thus forcing the less-likely successors to the end of the method. I suspect all those newly-introduced backward jumps are the source of the regression. I'll get a .NET 8 dump to compare this with shortly.

AndyAyersMS commented 3 weeks ago

Profiling suggests that the Scalar method is the biggest culprit here (though all 3 methods end up slower).

Note I am seeing some run to run variation when profiling, sometimes the regression is only about 4%. So there could be a data alignment issue here too.

dotnet run -c Release -f net8.0 -- --runtimes net8.0 net9.0 -p ETW -f * --apples --iterationCount 20

NET 8

39.45%   5.188E+07   Tier-1   [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector128(wchar&,wchar&,int32)
37.22%   4.894E+07   Tier-1   [bench]Tests.Contains_Iterate()
22.83%   3.002E+07   Tier-1   [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32)
00.21%   2.8E+05     native   clrjit.dll
00.12%   1.6E+05     native   coreclr.dll
00.11%   1.4E+05     native   ntoskrnl.exe

Benchmark: found 20 intervals; mean interval 655.585ms

NET 9 

35.10%   5.627E+07   Tier-1   [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector(wchar&,wchar&,int32)
33.50%   5.37E+07    Tier-1   [bench]Tests.Contains_Iterate()
30.97%   4.964E+07   Tier-1   [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32)
00.16%   2.6E+05     native   clrjit.dll
00.14%   2.2E+05     native   coreclr.dll
00.07%   1.2E+05     native   ntoskrnl.exe

Benchmark: found 20 intervals; mean interval 799.078ms

amanasifkhalid commented 3 weeks ago

Here's the .NET 8 layout. The old dump formatting isn't as explicit: If you don't see a jump target or type listed, the block falls into the next block. For conditional blocks, the true target is listed, and the false target is always the next block.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight      IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        10    [000..00F)-> BB28 ( cond )                     i idxlen nullcheck IBC 
BB07 [0072]  2       BB01,BB32           11587. 115866    [00F..019)                                     i idxlen bwd IBC 
BB10 [0002]  2       BB07,BB26           80987. 809866  1 [019..01F)-> BB29 ( cond )                     i Loop Loop0 idxlen bwd bwd-target IBC 
BB11 [0022]  1       BB10                80987. 809866  1 [01E..01F)-> BB13 ( cond )                     i bwd IBC 
BB12 [0027]  1       BB11                 1296k         1 [01E..01F)                                     i hascall gcsafe bwd 
BB13 [0028]  2       BB11,BB12           80987. 809866  1 [01E..01F)-> BB15 ( cond )                     i idxlen bwd IBC 
BB14 [0031]  1       BB13                 1296k         1 [01E..01F)-> BB18 (always)                     i bwd 
BB15 [0032]  1       BB13                 1296k         1 [01E..01F)                                     i idxlen nullcheck bwd 
BB18 [0033]  2       BB14,BB15           80987. 809866  1 [000..000)-> BB23 ( cond )                     i idxlen bwd IBC 
BB19 [0058]  1       BB18                80966. 809661  1 [000..000)-> BB21 ( cond )                     i internal bwd IBC 
BB20 [0065]  1       BB19                53582. 535821  1 [000..000)                                     i internal hascall gcsafe bwd IBC 
BB22 [0067]  2       BB20,BB21           80987. 809866  1 [000..035)-> BB30 ( cond )                     i gcsafe idxlen bwd IBC 
BB26 [0004]  2       BB22,BB23           80987. 809866  1 [037..041)-> BB10 ( cond )                     i idxlen bwd IBC 
BB27 [0006]  1       BB26                11587. 115866  0 [041..04F)-> BB28 ( cond )                     i bwd IBC 
BB32 [0074]  1       BB27                11587.         0 [???..???)-> BB07 (always)                     internal 
BB21 [0066]  1       BB19                27083. 270831  1 [000..000)-> BB22 (always)                     i internal hascall gcsafe bwd IBC 
BB23 [0059]  1       BB18                321.46   3215  1 [000..000)-> BB26 (always)                     i internal bwd IBC 
BB30 [0070]  1       BB22                321.46   3215    [035..037)        (return)                     i IBC 
BB28 [0008]  2       BB01,BB27             0         0    [04F..051)        (return)                     i rare IBC 
BB29 [0021]  1       BB10                  0         0    [01E..01F)        (throw )                     i rare hascall gcsafe bwd IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

The paths are much more interleaved. It's not obvious to me which one is better: more fallthrough, or fewer jumps out and back into the loop.

brianrob commented 3 weeks ago

In case you try adding -p ETW to the BDN command line to sample, and get an error like
Unhandled exception. System.Runtime.InteropServices.COMException (0x800700AA): The requested resource is in use. (0x800700AA)
it is probably dotnet/BenchmarkDotNet#2537 (aka https://github.com/microsoft/perfview/issues/1723) ... apparently windows defender can tie up the kernel session.

This sounds very much like Defender TDT, but it seems that folks are having trouble disabling it. I'll bring this up with the OS folks again.

AndyAyersMS commented 3 weeks ago

Codegen for all 3 methods

There are numerous differences... going to try and correlate sample hits back to the code, but (in release) can't get per-instructions offsets in the disassembly, so it may be a little painful.

AndyAyersMS commented 3 weeks ago

Codegen for Scalar is almost identical -- later block placement changes the size of two jumps (net9 on the right below)

Remainder of the method is not hit with any frequency. So seems like any perf difference here must be some microarchtectural issue.

Going to try running this under WSL and on other boxes.

AndyAyersMS commented 3 weeks ago

Not seeing anything like this on other XArch cpus...

Coffee Lake (Ubuntu/WSL -- Same HW, different OS/ABI)

BenchmarkDotNet v0.14.0, Ubuntu 20.04.6 LTS (Focal Fossa) WSL Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2 Job-DQFOXV : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2 Job-SFEZPB : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	321.0 us	1.00
Contains_Iterate	.NET 9.0	339.3 us	1.06

AMD Zen3

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-BZKTWK : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-DDWZYG : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	216.8 us	1.00
Contains_Iterate	.NET 9.0	211.0 us	0.97

Intel Cascade Lake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) (Hyper-V) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-NOEHZQ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-MOBSYO : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	294.2 us	1.00
Contains_Iterate	.NET 9.0	281.5 us	0.96

Intel Coffee Lake (different HW part)

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i9-9900T CPU 2.10GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.7.24407.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-VXXHSV : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-FXMEWY : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	346.6 us	1.00
Contains_Iterate	.NET 9.0	354.4 us	1.02

EgorBo commented 3 weeks ago

@EgorBot -amd -intel --runtimes net8.0 net9.0 --apples --iterationCount 50

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
    private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];

    [Benchmark]
    public bool Contains_Iterate()
    {
        ReadOnlySpan<char> input = s_input;

        for (int i = 0; i < input.Length; i++)
        {
            foreach (string dow in s_daysOfWeek)
            {
                if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))
                {
                    return true;
                }
            }
        }

        return false;
    }
}

EgorBot commented 3 weeks ago

Benchmark results on Amd

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores Job-NDYYMW : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-BLEUMT : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2 EvaluateOverhead=False OutlierMode=DontRemove InvocationCount=2048 IterationCount=50 UnrollFactor=16 WarmupCount=1 ``` | Method | Runtime | Mean | Ratio | |----------------- |--------- |---------:|------:| | Contains_Iterate | .NET 8.0 | 267.7 μs | 1.00 | | Contains_Iterate | .NET 9.0 | 289.3 μs | 1.08 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_4ca48b17.zip)

EgorBot commented 3 weeks ago

Benchmark results on Intel

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-ZIZYLZ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-RQGZIS : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI EvaluateOverhead=False OutlierMode=DontRemove InvocationCount=2048 IterationCount=50 UnrollFactor=16 WarmupCount=1 ``` | Method | Runtime | Mean | Ratio | |----------------- |--------- |---------:|------:| | Contains_Iterate | .NET 8.0 | 273.8 μs | 1.00 | | Contains_Iterate | .NET 9.0 | 317.2 μs | 1.16 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_0f8296d0.zip)

EgorBo commented 3 weeks ago

@AndyAyersMS it seems that the bot sees a small regression too (on two different cpus)

AndyAyersMS commented 3 weeks ago

For the benchmark method itself it seems credible that the changed layout is causing perf issues. With a checked jit there are several branches that might be hitting JCC errata (the .NET 8 version has none):

G_M30171_IG01:              ;; offset=0x0000
 00007ffa`b64269a0        push     rdi
 00007ffa`b64269a1        push     rsi
 00007ffa`b64269a2        push     rbp
 00007ffa`b64269a3        push     rbx
 00007ffa`b64269a4        sub      rsp, 40
                        ;; size=8 bbWeight=0.99 PerfScore 4.22
G_M30171_IG02:              ;; offset=0x0008
 00007ffa`b64269a8        mov      rcx, 0x1AA4F400D58      ; const ptr
 00007ffa`b64269b2        mov      rbx, gword ptr [rcx]
 00007ffa`b64269b5        add      rbx, 12
 00007ffa`b64269b9        xor      esi, esi
                        ;; size=19 bbWeight=0.99 PerfScore 2.73
G_M30171_IG03:              ;; offset=0x001B
 00007ffa`b64269bb        mov      rcx, 0x1AA4F400D60      ; const ptr
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
 00007ffa`b64269c5        mov      rdi, gword ptr [rcx]
 00007ffa`b64269c8        add      rdi, 16
 00007ffa`b64269cc        mov      ebp, 7
                        ;; size=22 bbWeight=21.63 PerfScore 59.48
G_M30171_IG04:              ;; offset=0x0031
 00007ffa`b64269d1        mov      rcx, gword ptr [rdi]
 00007ffa`b64269d4        cmp      esi, 0x3241AE
 00007ffa`b64269da        ja       G_M30171_IG20
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ja: 0 ; jcc erratum) 32B boundary ...............................
 00007ffa`b64269e0        mov      edx, esi
 00007ffa`b64269e2        lea      rax, bword ptr [rbx+2*rdx]
 00007ffa`b64269e6        mov      edx, 0x3241AE
 00007ffa`b64269eb        sub      edx, esi
 00007ffa`b64269ed        test     rcx, rcx
 00007ffa`b64269f0        jne      SHORT G_M30171_IG14
                        ;; size=33 bbWeight=170.49 PerfScore 980.34
G_M30171_IG05:              ;; offset=0x0052
 00007ffa`b64269f2        xor      r10, r10
 00007ffa`b64269f5        xor      r8d, r8d
                        ;; size=6 bbWeight=88.66 PerfScore 44.33
G_M30171_IG06:              ;; offset=0x0058
 00007ffa`b64269f8        cmp      r8d, edx
 00007ffa`b64269fb        jg       SHORT G_M30171_IG10
                        ;; size=5 bbWeight=170.49 PerfScore 213.12
G_M30171_IG07:              ;; offset=0x005D
 00007ffa`b64269fd        cmp      r8d, 8
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ...............................
 00007ffa`b6426a01        jge      SHORT G_M30171_IG15
                        ;; size=6 bbWeight=169.06 PerfScore 211.33
G_M30171_IG08:              ;; offset=0x0063
 00007ffa`b6426a03        mov      rcx, rax
 00007ffa`b6426a06        mov      rdx, r10
 00007ffa`b6426a09        call     [System.Globalization.Ordinal:EqualsIgnoreCase_Scalar(byref,byref,int):ubyte]
                        ;; size=12 bbWeight=104.67 PerfScore 366.35
G_M30171_IG09:              ;; offset=0x006F
 00007ffa`b6426a0f        test     eax, eax
 00007ffa`b6426a11        jne      SHORT G_M30171_IG18
                        ;; size=4 bbWeight=169.06 PerfScore 211.33
G_M30171_IG10:              ;; offset=0x0073
 00007ffa`b6426a13        add      rdi, 8
 00007ffa`b6426a17        dec      ebp
 00007ffa`b6426a19        jne      SHORT G_M30171_IG04
                        ;; size=8 bbWeight=170.49 PerfScore 255.74
G_M30171_IG11:              ;; offset=0x007B
 00007ffa`b6426a1b        inc      esi
 00007ffa`b6426a1d        cmp      esi, 0x3241AE
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 3 ; jcc erratum) 32B boundary ...............................
 00007ffa`b6426a23        jl       SHORT G_M30171_IG03

which could explain the relatively poor results on Coffee Lake / Windows.

Also re-enabling "old layout" in .NET 9 closes the gap:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.7.24406.3 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-EWCTBT : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 Job-WFMVIE : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_JitDoReversePostOrderLayout=0

Method	Runtime	Mean	Ratio
Contains_Iterate	.NET 8.0	347.8 us	1.00
Contains_Iterate	.NET 9.0	368.7 us	1.06

AndyAyersMS commented 3 weeks ago

And perhaps this causes some sample skid or something into the prologs of the Scalar and Vector versions, their first instructions have unusually high counts...

;; net 8

Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32) at 0x00007FFACB5F48E0 -- 0x00007FFACB5F4A60 (length 0x0180)
[0x48E0] 0x0000 : 247
[0x48E1] 0x0001 : 214
[0x48E5] 0x0005 : 118
[0x48E8] 0x0008 : 97

Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector128(wchar&,wchar&,int32) at 0x00007FFACB5F4A80 -- 0x00007FFACB5F4BF0 (length 0x0170)
[0x4A80] 0x0000 : 163
[0x4A81] 0x0001 : 161
[0x4A82] 0x0002 : 20
[0x4A86] 0x0006 : 138

;; net 9

Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Scalar(wchar&,wchar&,int32) at 0x00007FFABE78AB00 -- 0x00007FFABE78AC75 (length 0x0175)
[0xAB00] 0x0000 : 467
[0xAB01] 0x0001 : 94
[0xAB05] 0x0005 : 290
[0xAB0B] 0x000B : 90

Raw samples for [System.Private.CoreLib]Ordinal.EqualsIgnoreCase_Vector(wchar&,wchar&,int32) at 0x00007FFABE78ADE0 -- 0x00007FFABE78AF5F (length 0x017F)
[0xADE0] 0x0000 : 483
[0xADE4] 0x0004 : 202
[0xADEA] 0x000A : 866
[0xADF4] 0x0014 : 19

AndyAyersMS commented 3 weeks ago

So looks like there is a modest (~10% ish) overall regression (perhaps in the Vector path) across several different HW models, and then additional more sizeable regressions on older Intel HW from JCC errata, the latter caused (indirectly) by the new block layout.

It also seems possible we're not getting PGO data for the benchmark method (or are getting it and resynthesizing on top), those 48/52 likelihood splits are suspicious... let me look at that next.

AndyAyersMS commented 3 weeks ago

The lack of profile data comes from inlining System.String:op_Implicit(System.String):System.ReadOnlySpan`1[ushort].

This method is R2R and does tier up:

1557: JIT compiled System.String:op_Implicit(System.String) [Instrumented Tier1, IL size=31, code size=32]
1614: JIT compiled System.String:op_Implicit(System.String) [Tier1, IL size=31, code size=32]

Since Contains_Iterate has loops and the inner loop iterates a lot, it seems possible we're diverting execution to the OSR methods (which would inline the above) and so we never see the instrumented version get called

 669: JIT compiled Tests:Contains_Iterate() [Instrumented Tier0, IL size=81, code size=516]
1516: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x3b with Dynamic PGO, IL size=81, code size=301]
1534: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x45 with Dynamic PGO, IL size=81, code size=253]
1596: JIT compiled Tests:Contains_Iterate() [Instrumented Tier0, IL size=81, code size=465]
1601: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x3b with Dynamic PGO, IL size=81, code size=301]
1611: JIT compiled Tests:Contains_Iterate() [Tier1-OSR @0x45 with Dynamic PGO, IL size=81, code size=253]
1620: JIT compiled Tests:Contains_Iterate() [Tier1 with Dynamic PGO, IL size=81, code size=209]

Though it seems odd the runtime tells the jit there is no profile data instead of profile data that's all zero (or something).

AndyAyersMS commented 1 week ago

Looking with VTune (not --apples so raw numbers not directly comparable) on my Coffee Lake box, the .NET 9 code is much less CPU efficient, and the prominent issue is frontend stalls. CPI grows from 0.322 to 0.399 which is 25%ish...

.NET 8

.NET 9

AndyAyersMS commented 1 week ago

Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.

https://github.com/dotnet/runtime/issues/93243

Going to move this to future.

stephentoub commented 1 week ago

Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.

Reading between the lines, does this mean that the front-end stalls you mentioned in your previous comment are because of the JCC errata?

AndyAyersMS commented 1 week ago

Unfortunately, there's not much we can do here in .NET 9. We have discussed JCC errata mitigation but don't have a design yet on how this would be accomplished, or whether it could be enabled by default.

Reading between the lines, does this mean that the front-end stalls you mentioned in your previous comment are because of the JCC errata?

Yes. With .NET 9 and the new block layout we end up with poorly aligned branches and on older Intel CPUs this incurs performance penalties. It's not the fault of the layout per se; any mitigation we'd do would happen later on. Looking across our benchmark suite, there doesn't seem to be any systematic impact from JCC errata and the new layout, just "random" improvements and regressions on the older Intel machines.

dotnet / runtime