3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref

idg10 commented 2 months ago

Description

One of the benchmarks in our Corvus.JsonSchema library has wildly different performance characteristics on .NET 9.0 on certain hardware:

CPU	.NET 8.0	.NET 9.0 (9.0.24.40507)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake)	22ms	19ms
Intel Core i7-13800H	15ms	42ms

On the (rather old) Coffee Lake CPU, we see what you'd hope: .NET 9.0 is significantly faster than .NET 8.0.

On the much newer CPU (in a Surface Laptop Studio 2), in .NET 8.0 the benchmark runs a lot faster than on the old CPU, as is typical with a CPU that much newer. But running the same code on .NET 9.0 on that newer CPU is almost 3 times slower than with .NET 8.0 on the same CPU. (It's significantly slower even than .NET 8.0 on the much older CPU.)

To reproduce this, clone the repo from commit 2621745, run the Corvus.Json.Benchmark project, and select the ValidateLargeDocument benchmark.

The figures in the table above are for the ValidateLargeArrayCorvusV3 benchmark, but we see similar regressions (again, only on the newer CPU) for the ValidateLargeArrayCorvusV4 and ValidateLargeArrayCorvusValidator benchmarks.

We haven't yet succeeded in isolating whatever it is about this benchmark that produces these effects. So far our attempts to profile the code outside of BenchmarkDotNet haven't reproduced the issue. The code in question makes heavy use of System.Text.Json, so that's where we suspect the issue lies, but we can't prove that.

Configuration

Which version of .NET is the code running on?

NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 and .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

What OS version, and what distro if applicable?

On the Coffee Lake machine (which doesn't have this problem) we're running Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3).

The newer machine (which exhibits the problem) is running Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2).

What is the architecture (x64, x86, ARM, ARM64)?

x64, Intel Core.

If relevant, what are the specs of the machine?

On the machine that does not reproduce the problem, the CPU is as already described. The machine has 64GB of memory. (I can provide more details if required.)

The machine on which we see the problem is a Surface Laptop Studio 2 with 64GB of RAM.

Regression

We're seeing the regression from .NET 8.0.7 to .NET 9.0.0-preview.7.24405.7 (BenchmarkDotNet reports this as 9.0.24.40506.)

We first observed this with .NET 9.0.0-preview.6.

Data

The Throughput benchmark results on the older (Coffee Lake) CPU (on which we don't see the regression) are:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error     | StdDev    | Median    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio  |
|---------------------------------- |--------- |------------ |------------- |----------:|----------:|----------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|-------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  21.94 ms |  0.324 ms |  0.303 ms |  21.97 ms |  1.00 |    0.00 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  21.39 ms |  0.246 ms |  0.218 ms |  21.36 ms |  0.97 |    0.01 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  37.38 ms |  0.710 ms |  0.697 ms |  37.20 ms |  1.70 |    0.03 |          - |          - |         - |       961 B |        41.78 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 674.64 ms | 12.023 ms | 11.246 ms | 677.29 ms | 30.76 |    0.70 | 26000.0000 | 21000.0000 | 3000.0000 | 193961040 B | 8,433,088.70 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  18.58 ms |  0.271 ms |  0.241 ms |  18.50 ms |  0.85 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  17.97 ms |  0.232 ms |  0.193 ms |  17.96 ms |  0.82 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  38.59 ms |  2.037 ms |  5.975 ms |  37.90 ms |  1.63 |    0.24 |          - |          - |         - |      4936 B |       214.61 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 677.66 ms | 10.140 ms |  8.468 ms | 677.89 ms | 30.89 |    0.56 | 25000.0000 | 20000.0000 | 2000.0000 | 192861320 B | 8,385,274.78 |

As you can see, on that CPU .NET 9.0 does better than .NET 8.0.

Here are the same results for the 13th gen CPU in the Surface Laptop Studio 2:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  14.63 ms | 0.263 ms |  0.205 ms |  1.00 |    0.00 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  13.39 ms | 0.075 ms |  0.070 ms |  0.92 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  22.67 ms | 0.060 ms |  0.054 ms |  1.55 |    0.02 |          - |          - |         - |       902 B |         75.17 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 390.16 ms | 7.662 ms | 15.651 ms | 27.00 |    1.44 | 17000.0000 | 14000.0000 | 2000.0000 | 193948488 B | 16,162,374.00 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.31 ms | 0.151 ms |  0.141 ms |  2.90 |    0.04 |          - |          - |         - |        61 B |          5.08 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.11 ms | 0.158 ms |  0.148 ms |  2.61 |    0.04 |          - |          - |         - |        56 B |          4.67 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  59.78 ms | 0.105 ms |  0.098 ms |  4.09 |    0.06 |          - |          - |         - |      1306 B |        108.83 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 397.82 ms | 7.837 ms | 13.931 ms | 27.55 |    0.93 | 17000.0000 | 14000.0000 | 2000.0000 | 192858296 B | 16,071,524.67 |

As you can see, the .NET 9.0 numbers here (same preview build of .NET 9.0 - preview.7) for the first 3 benchmarks are significantly slower than for .NET 8.0. (And they are significantly slower than the same benchmarks on the much older CPU for either .NET 8.0 or 9.0 preview 7.)

EgorBo commented 2 months ago

I think we may be seeing AVX transition penalties at play. T

Nice catch! I guess you can then add _mm256_zeroupper there to close this issue 🙂

AndyAyersMS commented 2 months ago

Ok, well if that's the problem, it is good news, it should be fixable in time for the release. It's not the first time we've been burned by SSE2 in the native parts of the runtime, and probably not the last.

I was looking to see if there was any intel doc saying the AVX transition penalty was especially bad on the E (Gracemont) cores but no luck so far. @anthonycanino sorry to ping you randomly, but any chance you happen to know?

tannergooding commented 2 months ago

We should be annotating any CT_HELPERS that are known to use legacy encoded SIMD here in GenTreeCall::NeedsVzeroupper(Compiler* comp): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp

We default to checking for any that use floating-point returns/registers, but if they don't use it directly then we need to annotate explicitly instead.

mwadams commented 2 months ago

Thanks @AndyAyersMS and @EgorBo for persisting with this. It's a big deal for us, and it looks like this is actually going to have a positive impact rather than just eliminating a hit!

EgorBo commented 2 months ago

We should be annotating any CT_HELPERS that are known to use legacy encoded SIMD here in GenTreeCall::NeedsVzeroupper(Compiler* comp): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp

We default to checking for any that use floating-point returns/registers, but if they don't use it directly then we need to annotate explicitly instead.

I think it's better to insert it in the method itself - the helper is not only invoked by JIT, VM also call it from various places - not sure if the native compiler is smart enough to emit it where needed (e.g. we use AVX in GC code and may compile runtime with AVX in future)

AndyAyersMS commented 2 months ago

@EgorBo not sure if can we use vzeroupper in the runtime itself -- is it a no-op on SSE2 HW?

EgorBo commented 2 months ago

@EgorBo not sure if can we use vzeroupper in the runtime itself -- is it a no-op on SSE2 HW?

ah good point, that is jit then

tannergooding commented 2 months ago

It is not, it will fault on hardware without AVX support.

We could add a separate code path that checks for AVX support and then uses VEX encoded instructions (either a separate VEX loop or using vzeroupper) but that's more work and risk, and in the case of it being called from a method not using upper state may add hidden cost.

The NeedsVzeroupper API and its use-sites are setup to follow the official optimization manual guidance on where/when to insert vzeroupper, so if we're annotating the helpers correctly things should just work

AndyAyersMS commented 2 months ago

Thanks @AndyAyersMS and @EgorBo for persisting with this. It's a big deal for us, and it looks like this is actually going to have a positive impact rather than just eliminating a hit!

And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.

AndyAyersMS commented 2 months ago

@EgorBo are you going to work up a fix or do you want me to do it?

EgorBo commented 2 months ago

@EgorBo are you going to work up a fix or do you want me to do it?

I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in NeedsVzeroupper

mwadams commented 2 months ago

And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.

We just want these graphs to keep going in the right direction :)

https://endjin.com/blog/2023/12/how-dotnet-8-boosted-json-schema-performance-by-20-percent-for-free https://endjin.com/blog/2023/11/how-dotnet-8-boosted-ais-dotnet-performance-by-27-percent-for-free

AndyAyersMS commented 2 months ago

@EgorBo are you going to work up a fix or do you want me to do it?

I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in NeedsVzeroupper

Ok, I can put up a fix then.

mwadams commented 2 months ago

Let me know if there is a build you want me to help verify.

AndyAyersMS commented 2 months ago

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

EgorBo commented 2 months ago

Let me change my benchmark to do some AVX heavy-lifting first:

@EgorBot -intel --runtimes net8.0 net9.0

using System;
using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<MyBench>(args: args);

public class MyBench
{

    [Benchmark]
    public void NotInHeap()
    {
        Struct1 s1 = new();
        Struct1 s2 = new();
        ByrefCopy(ref s1, s2);
    }

    Struct1 dst1;

    [Benchmark]
    public void InHeap_Empty()
    {
        ByrefCopy(ref dst1, default);
    }

    [Benchmark]
    public void InHeap_Ephemeral()
    {
        ByrefCopy(ref dst1, new Struct1(1, null, 3, new Struct2(4, null, 6, null)));
    }

    private byte[] array = new byte[1000];

    [MethodImpl(MethodImplOptions.NoInlining)]
    public bool ByrefCopy(ref Struct1 dst, Struct1 src)
    {
        // Do some inline AVX:
        array.AsSpan(0, 128).Clear();

        dst = src;
        return true;
    }
}

public record struct Struct1(
    object a1, object a2,
    long a3, Struct2 g);

public record struct Struct2(
    object a1, object a2, object a3,
    object a4);

EgorBot commented 2 months ago

Benchmark results on Intel

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-LLCCCZ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-YEZQNO : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ``` | Method | Runtime | Mean | Error | Ratio | |----------------- |--------- |---------:|---------:|------:| | NotInHeap | .NET 8.0 | 21.78 ns | 0.039 ns | 1.00 | | NotInHeap | .NET 9.0 | 12.33 ns | 0.002 ns | 0.57 | | | | | | | | InHeap_Empty | .NET 8.0 | 26.23 ns | 0.055 ns | 1.00 | | InHeap_Empty | .NET 9.0 | 12.04 ns | 0.012 ns | 0.46 | | | | | | | | InHeap_Ephemeral | .NET 8.0 | 44.00 ns | 0.450 ns | 1.00 | | InHeap_Ephemeral | .NET 9.0 | 29.37 ns | 0.459 ns | 0.67 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_70cda8eb.zip)

mwadams commented 2 months ago

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

Quite happy to monkey patch (with instructions!)

mwadams commented 2 months ago

@EgorBo Running your benchmark: those numbers are quite definitive

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD
NotInHeap	.NET 8.0	10.88 ns	0.095 ns	0.084 ns	1.00	0.01
NotInHeap	.NET 9.0	84.15 ns	0.583 ns	0.545 ns	7.74	0.08

InHeap_Empty	.NET 8.0	12.00 ns	0.105 ns	0.093 ns	1.00	0.01
InHeap_Empty	.NET 9.0	94.68 ns	1.862 ns	1.992 ns	7.89	0.17

InHeap_Ephemeral	.NET 8.0	21.63 ns	0.299 ns	0.265 ns	1.00	0.02
InHeap_Ephemeral	.NET 9.0	100.32 ns	0.900 ns	0.702 ns	4.64	0.06

EgorBo commented 2 months ago

@EgorBo Running your benchmark: those numbers are quite definitive

Thanks! that definitely confirms the issue @AndyAyersMS found. It also likely means that Linux is not affected

AndyAyersMS commented 2 months ago

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

Quite happy to monkey patch (with instructions!)

Save and unzip the attached. As admin, copy this DLL to the preview 7 folder (you may want to copy off the existing jit first so you can put things back later). For me this is at

C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0-preview.7.24405.7

clrjit.zip

This is a checked build of the jit so it will be 5MB+ when unpacked. If you enable disassembly with this in place you will see more verbose outputs.

EgorBo commented 2 months ago

For me this is at

Can be obtained via

dotnet --list-runtimes | Select-String -Pattern "Microsoft.NETCore.App 9.0"

(powershell)

mwadams commented 2 months ago

With the monkey-patched DLL

Method	Runtime	Mean	Error	StdDev	Ratio
ProcessString	.NET 8.0	212.71 ns	2.491 ns	1.648 ns	1.00
ProcessString	.NET 9.0	95.75 ns	0.827 ns	0.733 ns	0.45

And the original 3x perf hit benchmark:

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
ValidateLargeArrayCorvusV4	.NET 8.0	13.48 ms	0.187 ms	0.175 ms	1.00	0.02	12 B	1.00
ValidateLargeArrayCorvusV4	.NET 9.0	11.23 ms	0.049 ms	0.045 ms	0.83	0.01	17 B	1.42

We're now running 17% faster on .NET 9.0 (which puts the graph back on track!)

dotnet / runtime