Closed idg10 closed 2 months ago
I think we may be seeing AVX transition penalties at play. T
Nice catch! I guess you can then add _mm256_zeroupper
there to close this issue 🙂
Ok, well if that's the problem, it is good news, it should be fixable in time for the release. It's not the first time we've been burned by SSE2 in the native parts of the runtime, and probably not the last.
I was looking to see if there was any intel doc saying the AVX transition penalty was especially bad on the E (Gracemont) cores but no luck so far. @anthonycanino sorry to ping you randomly, but any chance you happen to know?
We should be annotating any CT_HELPERS
that are known to use legacy encoded SIMD here in GenTreeCall::NeedsVzeroupper(Compiler* comp)
: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp
We default to checking for any that use floating-point returns/registers, but if they don't use it directly then we need to annotate explicitly instead.
Thanks @AndyAyersMS and @EgorBo for persisting with this. It's a big deal for us, and it looks like this is actually going to have a positive impact rather than just eliminating a hit!
We should be annotating any
CT_HELPERS
that are known to use legacy encoded SIMD here inGenTreeCall::NeedsVzeroupper(Compiler* comp)
: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cppWe default to checking for any that use floating-point returns/registers, but if they don't use it directly then we need to annotate explicitly instead.
I think it's better to insert it in the method itself - the helper is not only invoked by JIT, VM also call it from various places - not sure if the native compiler is smart enough to emit it where needed (e.g. we use AVX in GC code and may compile runtime with AVX in future)
@EgorBo not sure if can we use vzeroupper
in the runtime itself -- is it a no-op on SSE2 HW?
@EgorBo not sure if can we use
vzeroupper
in the runtime itself -- is it a no-op on SSE2 HW?
ah good point, that is jit then
It is not, it will fault on hardware without AVX support.
We could add a separate code path that checks for AVX support and then uses VEX encoded instructions (either a separate VEX loop or using vzeroupper
) but that's more work and risk, and in the case of it being called from a method not using upper state may add hidden cost.
The NeedsVzeroupper
API and its use-sites are setup to follow the official optimization manual guidance on where/when to insert vzeroupper, so if we're annotating the helpers correctly things should just work
Thanks @AndyAyersMS and @EgorBo for persisting with this. It's a big deal for us, and it looks like this is actually going to have a positive impact rather than just eliminating a hit!
And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.
@EgorBo are you going to work up a fix or do you want me to do it?
@EgorBo are you going to work up a fix or do you want me to do it?
I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in NeedsVzeroupper
And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.
We just want these graphs to keep going in the right direction :)
https://endjin.com/blog/2023/12/how-dotnet-8-boosted-json-schema-performance-by-20-percent-for-free https://endjin.com/blog/2023/11/how-dotnet-8-boosted-ais-dotnet-performance-by-27-percent-for-free
@EgorBo are you going to work up a fix or do you want me to do it?
I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in
NeedsVzeroupper
Ok, I can put up a fix then.
Let me know if there is a build you want me to help verify.
I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).
Let me change my benchmark to do some AVX heavy-lifting first:
@EgorBot -intel --runtimes net8.0 net9.0
using System;
using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<MyBench>(args: args);
public class MyBench
{
[Benchmark]
public void NotInHeap()
{
Struct1 s1 = new();
Struct1 s2 = new();
ByrefCopy(ref s1, s2);
}
Struct1 dst1;
[Benchmark]
public void InHeap_Empty()
{
ByrefCopy(ref dst1, default);
}
[Benchmark]
public void InHeap_Ephemeral()
{
ByrefCopy(ref dst1, new Struct1(1, null, 3, new Struct2(4, null, 6, null)));
}
private byte[] array = new byte[1000];
[MethodImpl(MethodImplOptions.NoInlining)]
public bool ByrefCopy(ref Struct1 dst, Struct1 src)
{
// Do some inline AVX:
array.AsSpan(0, 128).Clear();
dst = src;
return true;
}
}
public record struct Struct1(
object a1, object a2,
long a3, Struct2 g);
public record struct Struct2(
object a1, object a2, object a3,
object a4);
I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).
Quite happy to monkey patch (with instructions!)
@EgorBo Running your benchmark: those numbers are quite definitive
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
NotInHeap | .NET 8.0 | 10.88 ns | 0.095 ns | 0.084 ns | 1.00 | 0.01 |
NotInHeap | .NET 9.0 | 84.15 ns | 0.583 ns | 0.545 ns | 7.74 | 0.08 |
InHeap_Empty | .NET 8.0 | 12.00 ns | 0.105 ns | 0.093 ns | 1.00 | 0.01 |
InHeap_Empty | .NET 9.0 | 94.68 ns | 1.862 ns | 1.992 ns | 7.89 | 0.17 |
InHeap_Ephemeral | .NET 8.0 | 21.63 ns | 0.299 ns | 0.265 ns | 1.00 | 0.02 |
InHeap_Ephemeral | .NET 9.0 | 100.32 ns | 0.900 ns | 0.702 ns | 4.64 | 0.06 |
@EgorBo Running your benchmark: those numbers are quite definitive
Thanks! that definitely confirms the issue @AndyAyersMS found. It also likely means that Linux is not affected
I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).
Quite happy to monkey patch (with instructions!)
Save and unzip the attached. As admin, copy this DLL to the preview 7 folder (you may want to copy off the existing jit first so you can put things back later). For me this is at
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0-preview.7.24405.7
This is a checked build of the jit so it will be 5MB+ when unpacked. If you enable disassembly with this in place you will see more verbose outputs.
For me this is at
Can be obtained via
dotnet --list-runtimes | Select-String -Pattern "Microsoft.NETCore.App 9.0"
(powershell)
With the monkey-patched DLL
Method | Runtime | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
ProcessString | .NET 8.0 | 212.71 ns | 2.491 ns | 1.648 ns | 1.00 |
ProcessString | .NET 9.0 | 95.75 ns | 0.827 ns | 0.733 ns | 0.45 |
And the original 3x perf hit benchmark:
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|
ValidateLargeArrayCorvusV4 | .NET 8.0 | 13.48 ms | 0.187 ms | 0.175 ms | 1.00 | 0.02 | 12 B | 1.00 |
ValidateLargeArrayCorvusV4 | .NET 9.0 | 11.23 ms | 0.049 ms | 0.045 ms | 0.83 | 0.01 | 17 B | 1.42 |
We're now running 17% faster on .NET 9.0 (which puts the graph back on track!)
Description
One of the benchmarks in our
Corvus.JsonSchema
library has wildly different performance characteristics on .NET 9.0 on certain hardware:On the (rather old) Coffee Lake CPU, we see what you'd hope: .NET 9.0 is significantly faster than .NET 8.0.
On the much newer CPU (in a Surface Laptop Studio 2), in .NET 8.0 the benchmark runs a lot faster than on the old CPU, as is typical with a CPU that much newer. But running the same code on .NET 9.0 on that newer CPU is almost 3 times slower than with .NET 8.0 on the same CPU. (It's significantly slower even than .NET 8.0 on the much older CPU.)
To reproduce this, clone the repo from commit 2621745, run the
Corvus.Json.Benchmark
project, and select theValidateLargeDocument
benchmark.The figures in the table above are for the
ValidateLargeArrayCorvusV3
benchmark, but we see similar regressions (again, only on the newer CPU) for theValidateLargeArrayCorvusV4
andValidateLargeArrayCorvusValidator
benchmarks.We haven't yet succeeded in isolating whatever it is about this benchmark that produces these effects. So far our attempts to profile the code outside of BenchmarkDotNet haven't reproduced the issue. The code in question makes heavy use of
System.Text.Json
, so that's where we suspect the issue lies, but we can't prove that.Configuration
NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 and .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
On the Coffee Lake machine (which doesn't have this problem) we're running Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3).
The newer machine (which exhibits the problem) is running Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2).
x64, Intel Core.
On the machine that does not reproduce the problem, the CPU is as already described. The machine has 64GB of memory. (I can provide more details if required.)
The machine on which we see the problem is a Surface Laptop Studio 2 with 64GB of RAM.
Regression
We're seeing the regression from .NET 8.0.7 to .NET 9.0.0-preview.7.24405.7 (BenchmarkDotNet reports this as 9.0.24.40506.)
We first observed this with .NET 9.0.0-preview.6.
Data
The
Throughput
benchmark results on the older (Coffee Lake) CPU (on which we don't see the regression) are:As you can see, on that CPU .NET 9.0 does better than .NET 8.0.
Here are the same results for the 13th gen CPU in the Surface Laptop Studio 2:
As you can see, the .NET 9.0 numbers here (same preview build of .NET 9.0 - preview.7) for the first 3 benchmarks are significantly slower than for .NET 8.0. (And they are significantly slower than the same benchmarks on the much older CPU for either .NET 8.0 or 9.0 preview 7.)