dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref #106679

Closed idg10 closed 3 weeks ago

idg10 commented 1 month ago

Description

One of the benchmarks in our Corvus.JsonSchema library has wildly different performance characteristics on .NET 9.0 on certain hardware:

CPU .NET 8.0 .NET 9.0 (9.0.24.40507)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake) 22ms 19ms
Intel Core i7-13800H 15ms 42ms

On the (rather old) Coffee Lake CPU, we see what you'd hope: .NET 9.0 is significantly faster than .NET 8.0.

On the much newer CPU (in a Surface Laptop Studio 2), in .NET 8.0 the benchmark runs a lot faster than on the old CPU, as is typical with a CPU that much newer. But running the same code on .NET 9.0 on that newer CPU is almost 3 times slower than with .NET 8.0 on the same CPU. (It's significantly slower even than .NET 8.0 on the much older CPU.)

To reproduce this, clone the repo from commit 2621745, run the Corvus.Json.Benchmark project, and select the ValidateLargeDocument benchmark.

The figures in the table above are for the ValidateLargeArrayCorvusV3 benchmark, but we see similar regressions (again, only on the newer CPU) for the ValidateLargeArrayCorvusV4 and ValidateLargeArrayCorvusValidator benchmarks.

We haven't yet succeeded in isolating whatever it is about this benchmark that produces these effects. So far our attempts to profile the code outside of BenchmarkDotNet haven't reproduced the issue. The code in question makes heavy use of System.Text.Json, so that's where we suspect the issue lies, but we can't prove that.

Configuration

Which version of .NET is the code running on?

NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 and .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

What OS version, and what distro if applicable?

On the Coffee Lake machine (which doesn't have this problem) we're running Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3).

The newer machine (which exhibits the problem) is running Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2).

What is the architecture (x64, x86, ARM, ARM64)?

x64, Intel Core.

If relevant, what are the specs of the machine?

On the machine that does not reproduce the problem, the CPU is as already described. The machine has 64GB of memory. (I can provide more details if required.)

The machine on which we see the problem is a Surface Laptop Studio 2 with 64GB of RAM.

Regression

We're seeing the regression from .NET 8.0.7 to .NET 9.0.0-preview.7.24405.7 (BenchmarkDotNet reports this as 9.0.24.40506.)

We first observed this with .NET 9.0.0-preview.6.

Data

The Throughput benchmark results on the older (Coffee Lake) CPU (on which we don't see the regression) are:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error     | StdDev    | Median    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio  |
|---------------------------------- |--------- |------------ |------------- |----------:|----------:|----------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|-------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  21.94 ms |  0.324 ms |  0.303 ms |  21.97 ms |  1.00 |    0.00 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  21.39 ms |  0.246 ms |  0.218 ms |  21.36 ms |  0.97 |    0.01 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  37.38 ms |  0.710 ms |  0.697 ms |  37.20 ms |  1.70 |    0.03 |          - |          - |         - |       961 B |        41.78 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 674.64 ms | 12.023 ms | 11.246 ms | 677.29 ms | 30.76 |    0.70 | 26000.0000 | 21000.0000 | 3000.0000 | 193961040 B | 8,433,088.70 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  18.58 ms |  0.271 ms |  0.241 ms |  18.50 ms |  0.85 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  17.97 ms |  0.232 ms |  0.193 ms |  17.96 ms |  0.82 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  38.59 ms |  2.037 ms |  5.975 ms |  37.90 ms |  1.63 |    0.24 |          - |          - |         - |      4936 B |       214.61 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 677.66 ms | 10.140 ms |  8.468 ms | 677.89 ms | 30.89 |    0.56 | 25000.0000 | 20000.0000 | 2000.0000 | 192861320 B | 8,385,274.78 |

As you can see, on that CPU .NET 9.0 does better than .NET 8.0.

Here are the same results for the 13th gen CPU in the Surface Laptop Studio 2:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  14.63 ms | 0.263 ms |  0.205 ms |  1.00 |    0.00 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  13.39 ms | 0.075 ms |  0.070 ms |  0.92 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  22.67 ms | 0.060 ms |  0.054 ms |  1.55 |    0.02 |          - |          - |         - |       902 B |         75.17 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 390.16 ms | 7.662 ms | 15.651 ms | 27.00 |    1.44 | 17000.0000 | 14000.0000 | 2000.0000 | 193948488 B | 16,162,374.00 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.31 ms | 0.151 ms |  0.141 ms |  2.90 |    0.04 |          - |          - |         - |        61 B |          5.08 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.11 ms | 0.158 ms |  0.148 ms |  2.61 |    0.04 |          - |          - |         - |        56 B |          4.67 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  59.78 ms | 0.105 ms |  0.098 ms |  4.09 |    0.06 |          - |          - |         - |      1306 B |        108.83 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 397.82 ms | 7.837 ms | 13.931 ms | 27.55 |    0.93 | 17000.0000 | 14000.0000 | 2000.0000 | 192858296 B | 16,071,524.67 |

As you can see, the .NET 9.0 numbers here (same preview build of .NET 9.0 - preview.7) for the first 3 benchmarks are significantly slower than for .NET 8.0. (And they are significantly slower than the same benchmarks on the much older CPU for either .NET 8.0 or 9.0 preview 7.)

huoyaoyuan commented 1 month ago

How about installing 9.0 package of System.Text.Json on 8.0 runtime? This helps identifying the problem in JIT or new code pattern used in System.Text.Json.

idg10 commented 1 month ago

Using the 9.0 System.Text.Json package on 8.0 doesn't seem to change anything:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  12.66 ms | 0.085 ms |  0.080 ms |  1.00 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  12.93 ms | 0.140 ms |  0.131 ms |  1.02 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  21.74 ms | 0.049 ms |  0.046 ms |  1.72 |    0.01 |          - |          - |         - |       920 B |         76.67 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 408.42 ms | 8.010 ms |  8.226 ms | 32.27 |    0.66 | 17000.0000 | 14000.0000 | 2000.0000 | 195061544 B | 16,255,128.67 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.09 ms | 0.093 ms |  0.073 ms |  3.33 |    0.02 |          - |          - |         - |        65 B |          5.42 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.18 ms | 0.144 ms |  0.135 ms |  3.02 |    0.02 |          - |          - |         - |        56 B |          4.67 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  60.64 ms | 0.253 ms |  0.237 ms |  4.79 |    0.03 |          - |          - |         - |      1338 B |        111.50 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 392.64 ms | 7.740 ms | 13.956 ms | 31.02 |    1.11 | 17000.0000 | 14000.0000 | 2000.0000 | 192693192 B | 16,057,766.00 |

So this looks likely to be a JIT issue of some kind. (It might still be something that System.Text.Json runs into, but it doesn't appear to be a change in that library that causes this.)

mwadams commented 1 month ago

Running the same assemblies compiled for net80 on both the .NET 8 and .NET 9 runtimes produces the same results, so it looks very likely that this is a JIT-related regression.

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  12.64 ms | 0.096 ms |  0.085 ms |  1.00 |    0.00 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  13.01 ms | 0.200 ms |  0.187 ms |  1.03 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  22.39 ms | 0.071 ms |  0.067 ms |  1.77 |    0.01 |          - |          - |         - |       920 B |         76.67 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 389.74 ms | 7.713 ms | 15.582 ms | 29.71 |    0.83 | 17000.0000 | 14000.0000 | 2000.0000 | 193948480 B | 16,162,373.33 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.43 ms | 0.261 ms |  0.244 ms |  3.36 |    0.03 |          - |          - |         - |        65 B |          5.42 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.33 ms | 0.151 ms |  0.134 ms |  3.03 |    0.02 |          - |          - |         - |        77 B |          6.42 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  60.24 ms | 0.253 ms |  0.224 ms |  4.76 |    0.04 |          - |          - |         - |      1338 B |        111.50 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 415.50 ms | 8.054 ms |  7.533 ms | 32.83 |    0.69 | 17000.0000 | 14000.0000 | 2000.0000 | 192858880 B | 16,071,573.33 |
Tornhoof commented 1 month ago

That CPU is one of those P/E Core thingies isn't it? You could try with --affinity, as described here https://github.com/dotnet/BenchmarkDotNet/blob/master/docs/articles/guides/console-args.md something like --affinity 1 and --affinity 65536 to see if there are any differences

huoyaoyuan commented 1 month ago

Note the difference of Allocated column. It's probably caused by higher GC pressure.

You can try to use allocation profiler of VS to diagnose what's allocated more.

EgorBo commented 1 month ago

You may want to disable DATAS as well to see if the regression coming from that

<GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode>

or

DOTNET_GCDynamicAdaptationMode=0

envvar.

mwadams commented 4 weeks ago

Note the difference of Allocated column. It's probably caused by higher GC pressure.

You can try to use allocation profiler of VS to diagnose what's allocated more.

You can disregard the JsonEverything rows. There is essentially zero allocation and actual no GCs in the Corvus code (which is the code under test). JsonEverything is Greg Dennis's JSON schema code which is our "competitive comparison".

mwadams commented 4 weeks ago

That CPU is one of those P/E Core thingies isn't it? You could try with --affinity, as described here https://github.com/dotnet/BenchmarkDotNet/blob/master/docs/articles/guides/console-args.md something like --affinity 1 and --affinity 65536 to see if there are any differences

We saw no difference when setting a particular processor affinity.

We've also correlated the benchmark runs with the processor ETW data and determined that it is not being throttled in either case.

mwadams commented 4 weeks ago

Where we are now is that we've added the Microsoft Performance CPU usage Diagnoser to the benchmark, so we can use the perf viewer to figure out what is going on.

The code emitted is quite different; notably, the System.Text.Json.TextEquals method seems much slower on .NET 9 and yet that is just a bit of scaffolding around RoS<byte>.SequenceEqual

My current suspicion is that it is a regression related to memory alignment or cacheability of the buffers STJ is allocating (which is also why we see the exact same behaviour with code compiled for net80 or net90 under the net90 runtime).

EgorBo commented 4 weeks ago

Please attach the complete set of steps to build & run the repro (or, better, a self-contained one)

EgorBo commented 4 weeks ago

Although, I managed to run it on the given commit and can't reproduce your numbers on Ryzen 7950x (the only x64 CPU I have). Perhaps, someone in @dotnet/jit-contrib has a similar to Intel Core i7-13800H cpu?

mwadams commented 4 weeks ago

Attached is a standalone project with the single relevant benchmark.

Repro106679.zip

The machine on which this reproduces is: Surface Laptop Studio 2 Processor 13th Gen Intel(R) Core(TM) i7-13800H 2.90 GHz Installed RAM 64.0 GB (63.8 GB usable) System type 64-bit operating system, x64-based processor

huoyaoyuan commented 4 weeks ago

I can also reproduce this on i9-13900K, which is also Rapter Lake:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-XCKTZZ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-QRHGPG : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

OutlierMode=RemoveAll  RunStrategy=Throughput
Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidateLargeArrayCorvusV4 .NET 8.0 12.30 ms 0.057 ms 0.053 ms 1.00 0.01 12 B 1.00
ValidateLargeArrayCorvusV4 .NET 9.0 35.54 ms 0.195 ms 0.183 ms 2.89 0.02 71 B 5.92
huoyaoyuan commented 4 weeks ago

I'm seeing the same bahavior when using System.Text.Json 9.0 with 8.0 runtime.

JsonDocument.TextEquals contains calls to array pool, which doesn't belong to S.T.Json, can cause allocation and may behave differently by CPU architectures. It looks suspicious.

huoyaoyuan commented 4 weeks ago

Note that this reproduction is still not good enough. It contains non-trivial amount of third party code at Corvus.Json.ExtendedTypes. Ideally all the interesting path should be available as source.

And also a kindly reminder that ImmutableList<T> is not a performant type. You probably want ImmutableArray<T>.

mwadams commented 4 weeks ago

Note that this reproduction is still not good enough. It contains non-trivial amount of third party code at Corvus.Json.ExtendedTypes. Ideally all the interesting path should be available as source.

And also a kindly reminder that ImmutableList<T> is not a performant type. You probably want ImmutableArray<T>.

Corvus.Json.ExtendedTypes is our code under test in this benchmark, so it is not really reducible. The source is available in the commit at the link provided in the OP.

Thanks for the observation about ImmutableArray<T>. Note that the ImmutableList<T> is just a transient in the setup code - it isn't the bit that is under test. And although this is not relevant to the issue at hand, you might be interested in the reason for the use of ImmutableList<T>.

Our readonly structs are essentially unions of a JsonElement and a native dotnet type - e.g. string for JSON strings. In the usual (perf-sensitive) case, we are simply a wrapper around the JsonElement so we benefit from the low-allocation behaviour of that type, and the other half of the union is "default/empty". However, if you mutate a part of the data structure, we use the other half of the union, and create an instance of the non-JsonElement type to hold the mutated part (with the JsonElement now set to default). The unchanged parts continue to wrap the JsonElement backing (this is a little like JsonNode, but as a struct; it also predates JsonNode).

For the non-JsonElement backing for objects and arrays, we require an array-like collection. We have to use ImmutableList<T> not ImmutableArray<T> because the data structure is recursive (i.e. the items/properties of arrays/objects can themselves be arrays or objects) and you cannot create an ImmutableArray<T> where instances of T themselves contain an ImmutableArray<T>.

huoyaoyuan commented 4 weeks ago

I think CPU profiling is misleading in this case.

In detailed statement-level view, it shows significant difference in return statements of two methods:

image image

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

mwadams commented 4 weeks ago

I think CPU profiling is misleading in this case.

In detailed statement-level view, it shows significant difference in return statements of two methods:

image image

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

I agree that the mapping of the CPU profiling to methods is a bit misleading. If you look at the JITted code it jumps around all over the place, and looks quite challenging to tie down precisely! One thing we notice is that the jumps in the hot path are now slightly longer in the net9.0 as the code has been slightly reordered, but I don't know if that is significant.

Also, if you correlate the benchmarks with ETW data, I don't believe that there are any GCs in the benchmark period as there is so little GC pressure. [I haven't confirmed that today, though.]

mwadams commented 4 weeks ago

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

I did wonder if this case might be caused by unnecessary copying - that is a static readonly instance of a readonly struct and it should be possible to optimize it all away up and down the stack to a reference to the instance; but I haven't looked at the JITted code to see what is actually happening.

huoyaoyuan commented 4 weeks ago

Meanwhile, I can't find anything allocated on the heap in the core loop by doing memory profiling or heap snapshots. It's also strange why BDN reports heap allocation difference.

huoyaoyuan commented 4 weeks ago

I have really no more time to help on this. Here are some conclusions so far.

Corvus.Json.ExtendedTypes is our code under test in this benchmark, so it is not really reducible. The source is available in the commit at the link provided in the OP.

Yes, but it's distributed as NuGet package instead of source code in your reproduction. We need source code to ease more AdHoc investigations.


Commenting out ArrayValidationHandler and following part in PersonArray.Validate will result in a more dramatic regression:

Method Runtime Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
ValidateLargeArrayCorvusV4 .NET 8.0 9.814 ns 0.0312 ns 0.0260 ns 1.00 0.00 1,064 B - NA
ValidateLargeArrayCorvusV4 .NET 9.0 88.357 ns 0.3182 ns 0.2976 ns 9.00 0.04 1,057 B - NA

Looking at the disassembly with [DisassemblyDiagnoser(maxDepth: 20)], the beginning part of Corvus.Json.ValidateWithoutCoreType.TypeArray shows a significant difference:

.NET 9:

```asm push rdi push rsi push rbx sub rsp,50 xor eax,eax mov [rsp+28],rax vxorps xmm4,xmm4,xmm4 vmovdqu ymmword ptr [rsp+30],ymm4 mov rbx,rcx mov edi,edx mov rsi,r8 cmp dil,2 jne short M02_L02 cmp r9d,3 je short M02_L01 mov rcx,rbx cmp [rcx],cl mov rdx,rsi cmp [rdx],dl mov r8d,38 call qword ptr [7FF8C1225620]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr) M02_L00: mov rax,rbx add rsp,50 pop rbx pop rsi pop rdi ret ```

.NET 8:

```asm push rdi push rsi push rbp push rbx sub rsp,48 vzeroupper vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+20],xmm4 vmovdqa xmmword ptr [rsp+30],xmm4 xor eax,eax mov [rsp+40],rax mov rbx,rcx mov ebp,edx mov rsi,r8 cmp bpl,2 jne short M02_L01 cmp r9d,3 je near ptr M02_L10 mov rdi,rbx call CORINFO_HELP_ASSIGN_BYREF call CORINFO_HELP_ASSIGN_BYREF movsq call CORINFO_HELP_ASSIGN_BYREF call CORINFO_HELP_ASSIGN_BYREF call CORINFO_HELP_ASSIGN_BYREF call CORINFO_HELP_ASSIGN_BYREF M02_L00: mov rax,rbx add rsp,48 pop rbx pop rbp pop rsi pop rdi ret ```

But I'm really not sure whether it's the culprit. It also helps if the methods are directly editable as source.


My further suggestion is to split the reproduction into smaller pieces and see the performance and disassembly of them. It needs to be narrowed down to become more actionable.

You can also try each preview of .NET 9 to help identify which commit range causes the difference.

mwadams commented 4 weeks ago

I have pulled in all the essential pieces as source in a single project.

Repro106679-2.zip

I have then added several additional benchmarks.

Doing a simple string validation is 2x faster on .NET 9 v. .NET 8

Method Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
ValidateStringCorvusV4 .NET 8.0 10.749 ns 0.0332 ns 0.0294 ns 1.00 - NA
ValidateStringCorvusV4 .NET 9.0 5.164 ns 0.0263 ns 0.0246 ns 0.48 - NA

However, calling Validate on PersonNameElement is 3x slower on .NET 9 v. .NET 8

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidatePersonNameElementCorvusV4 .NET 8.0 126.273 ns 0.6925 ns 0.6477 ns 1.00 0.01 - NA
ValidatePersonNameElementCorvusV4 .NET 9.0 392.391 ns 0.9471 ns 0.7909 ns 3.11 0.02 - NA

However! If we comment out the code that gets the actual raw JSON text we see that the .NET 9 code is back to being 2x faster.

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidatePersonNameElement2CorvusV4 .NET 8.0 11.072 ns 0.0411 ns 0.0384 ns 0.09 0.00 - NA
ValidatePersonNameElement2CorvusV4 .NET 9.0 4.132 ns 0.0286 ns 0.0267 ns 0.03 0.00 - NA

Driling into this, I then tried benchmarking the code that gets the raw JSON text. Again that is faster under .NET 9 than .NET 8. (GetString)

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 70.16 ns 0.263 ns 0.233 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 66.86 ns 0.488 ns 0.407 ns 0.95 0.0050 64 B 1.00

Similarly, it is still faster if I create and pass in the same context as I use in the real validation code. (GetStringWithValidationContext)

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 180.8 ns 1.31 ns 1.23 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 157.5 ns 0.92 ns 0.77 ns 0.87 0.0050 64 B 1.00

Howver, if I update the validation context and return it, the result is 33% slower in .NET 9 than .NET 8.

This suggests to me that there is a regression with readonly structs passed by ref.

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 195.3 ns 0.93 ns 0.87 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 265.7 ns 1.21 ns 1.13 ns 1.36 0.0048 64 B 1.00
AndyAyersMS commented 3 weeks ago

@mwadams the above is all still measured on the i7-13000H? If so we are still tring to see if anyone around here has something like this.

A few notes:

  1. When I try the ValidateString benchmark from zip I get errors :

    System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
    ---> System.Text.Json.JsonReaderException: 'M' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.
    at System.Text.Json.ThrowHelper.ThrowJsonReaderException(Utf8JsonReader& json, ExceptionResource resource, Byte nextByte, ReadOnlySpan`1 bytes)
    at System.Text.Json.Utf8JsonReader.ConsumeValue(Byte marker)
    at System.Text.Json.Utf8JsonReader.ReadFirstToken(Byte first)
    at System.Text.Json.Utf8JsonReader.ReadSingleSegment()
    at System.Text.Json.Utf8JsonReader.Read()
    at System.Text.Json.JsonDocument.TryParseValue(Utf8JsonReader& reader, JsonDocument& document, Boolean shouldThrow, Boolean useArrayPools)
    at System.Text.Json.JsonElement.ParseValue(Utf8JsonReader& reader)
    at Corvus.Json.IJsonValue`1.ParseValue(Utf8JsonReader& reader) in C:\bugs\r106679\Repro106679-2\Repro106679-2\Repro106679\FromExtendedTypes\IJsonValue{T}.cs:line 187
  2. Can you try running as admin and enabling the ETW profiler and share out the resulting trace data?

  3. Another hunch is that perhaps you're seeing new store-forwarding stalls with .Net 9; perhaps they are especially costly on E cores or something. Can you try a run with DOTNET_JitEnablePhysicalPromotion=0?

mwadams commented 3 weeks ago

I'll fix that benchmark - and try this out tonight.

mwadams commented 3 weeks ago

For some reason it had lost its \" in the string. Fixed that here.

Repro106679-3.zip

mwadams commented 3 weeks ago

With

image

We get the same results

Method Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
ProcessString .NET 8.0 187.0 ns 0.64 ns 0.60 ns 1.00 - NA
ProcessString .NET 9.0 258.3 ns 1.23 ns 1.15 ns 1.38 - NA
mwadams commented 3 weeks ago

And here's the zip file containing the 2 ETW traces, and the log file.

Corvus.Json.Benchmarking.GetStringWithValidationContextModifyingContext.ProcessString-.NET 8.0 and 9.0-20240822-201415.zip

mwadams commented 3 weeks ago

@AndyAyersMS - see above for the requested info.

This is a Surface Laptop 2 Studio if that helps people dig out a device!

GerardSmit commented 3 weeks ago

Results on my machine:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-HWRMHO : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-TWQIYS : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

OutlierMode=RemoveAll  RunStrategy=Throughput

I had to remove [CPUUsageDiagnoser] from the benchmarks, otherwise I was getting the following exception:

$ dotnet run -c Release
...
You should select the target benchmark(s).
...
[entered: GetString]
...
Unhandled exception. System.InvalidOperationException: Tool CPUUsageTool is already added.
   at Microsoft.VSDiagnostics.VSDiagnosticsDiagnoser..ctor(IDiagnosticsTool tool, String outputFileName)

GetString

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 95.04 ns 0.945 ns 0.884 ns 1.00 0.0038 64 B 1.00
ProcessString .NET 9.0 82.83 ns 0.252 ns 0.224 ns 0.87 0.0038 64 B 1.00

GetStringWithValidationContextModifyingContext

Added after https://github.com/dotnet/runtime/issues/106679#issuecomment-2305498044

Method Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
ProcessString .NET 8.0 161.0 ns 0.69 ns 0.64 ns 1.00 - NA
ProcessString .NET 9.0 211.7 ns 0.54 ns 0.48 ns 1.32 - NA
mwadams commented 3 weeks ago

@AndyAyersMS - Just eyeballing the callstacks, the .NET 9.0 code for the inner callback is astronomically slower, and appears to have an additional block copy which should be wholly unnecessary.

.NET 8.0 image

.NET 9.0 image

mwadams commented 3 weeks ago

@GerardSmit If you try GetStringWithValidationContextModifyingContext that has the minimal repro for the problem.

(I believe that the overall 3x imapct for the "large" benchmark in the OP is the cumulative impact of the apparent regression with the refs to readonly structs.)

mwadams commented 3 weeks ago

@GerardSmit - so you are seeing an issue with plain GetString on your CPU?

The one with

    /// <summary>
    /// Validates using the Corvus V4 types.
    /// </summary>
    [Benchmark]
    public bool ProcessString()
    {
        return this.jsonString.TryGetValue(ProcessString, default(object?), out int length);
    }

    private bool ProcessString(ReadOnlySpan<char> span, in object? state, out int value)
    {
        value = span.Length;
        return true;
    }

For GetString I get this - which is actually a small perf improvement under .NET 9.0 which is what I was expecting, knowing what has changed in S.T.J in .NET 9.0.

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 72.15 ns 0.295 ns 0.261 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 67.10 ns 0.425 ns 0.377 ns 0.93 0.0050 64 B 1.00
GerardSmit commented 3 weeks ago

@mwadams Ah.. Benchmark.NET selected the benchmark "GetStringWithValidationContextModifyingContext" when typing in "GetString" 😅. When running the real "GetString" it's also faster on my machine. I've updated my comment with the correct times to prevent confusion.

AndyAyersMS commented 3 weeks ago

From the ETL I get the following profile data (limited to benchmark intervals)

============= .NET 8

01.73%   2.02E+06    ?        Unknown
38.52%   4.493E+07   native   coreclr.dll
34.24%   3.993E+07   Tier-1   [Repro106679]GetStringWithValidationContextModifyingContext.ProcessString()
03.64%   4.25E+06    Tier-1   [Repro106679]JsonElementExtensions.ProcessDecodedText(value class System.ReadOnlySpan`1<unsigned int8>,value class ParserStateWrapper`2<!!0,!!1>&,!!1&)
03.30%   3.85E+06    Tier-1   [Repro106679]LowAllocJsonUtils.ProcessRawText(value class System.Text.Json.JsonElement,!!0&,class Corvus.Json.Utf8Parser`2<!!0,!!1>&,!!1&)
02.43%   2.83E+06    Tier-1   [Repro106679]LowAllocJsonUtils+PooledWriter.Reset()
01.89%   2.2E+06     Tier-1   [System.Text.Json]JsonDocument.WriteString(value class DbRow&,class System.Text.Json.Utf8JsonWriter)
01.74%   2.03E+06    Tier-1   [Repro106679]JsonElementExtensions.ProcessRawText(value class System.ReadOnlySpan`1<unsigned int8>,value class ParserStateWrapper`2<!!0,!!1>&,!!1&)
01.37%   1.6E+06     Tier-1   [Repro106679]LowAllocJsonUtils+PooledWriter.Get()
01.04%   1.21E+06    Tier-1   [Repro106679]JsonReaderHelper.TranscodeHelper(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Span`1<wchar>)
01.00%   1.17E+06    Tier-1   [System.Text.Json]JsonReaderHelper.ToValueKind(value class System.Text.Json.JsonTokenType)
00.91%   1.06E+06    Tier-1   [System.Text.Json]Utf8JsonWriter.WriteStringMinimized(value class System.ReadOnlySpan`1<unsigned int8>)

coreclr!memset                  22%
coreclr!JIT_ByRefWriteBarrier    4%

============= .NET 9

01.40%   1.16E+06    ?        Unknown
40.26%   3.339E+07   native   coreclr.dll
37.34%   3.097E+07   Tier-1   [Repro106679]GetStringWithValidationContextModifyingContext.ProcessString()
04.56%   3.78E+06    Tier-1   [Repro106679]LowAllocJsonUtils.ProcessRawText(value class System.Text.Json.JsonElement,!!0&,class Corvus.Json.Utf8Parser`2<!!0,!!1>&,!!1&)
01.77%   1.47E+06    Tier-1   [System.Text.Json]JsonDocument.WriteString(value class DbRow&,class System.Text.Json.Utf8JsonWriter)
01.72%   1.43E+06    Tier-1   [Repro106679]JsonElementExtensions.ProcessDecodedText(value class System.ReadOnlySpan`1<unsigned int8>,value class ParserStateWrapper`2<!!0,!!1>&,!!1&)
01.35%   1.12E+06    Tier-1   [Repro106679]JsonElementExtensions.ProcessRawText(value class System.ReadOnlySpan`1<unsigned int8>,value class ParserStateWrapper`2<!!0,!!1>&,!!1&)
01.09%   9E+05       Tier-1   [System.Text.Json]Utf8JsonWriter.WriteStringMinimized(value class System.ReadOnlySpan`1<unsigned int8>)
01.06%   8.8E+05     Tier-1   [Repro106679]JsonReaderHelper.TranscodeHelper(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Span`1<wchar>)
01.05%   8.7E+05     Tier-1   [System.Private.CoreLib]SpanHelpers.ClearWithoutReferences(unsigned int8&,unsigned int)

coreclr is

coreclr!Buffer::BulkMoveWithWriteBarrier      32%
coreclr!COMInterlocked::CompareExchangeObject 7%

(only percentages significant, raw counts differ since BDN used different iteration patterns).

Suggests possibly some interaction with the changes in https://github.com/dotnet/runtime/pull/101761.

Even in .NET 8 this benchmark is spending a lot of time copying struct data.

Can you get the disassembly for ProcessString on your CPU?

EgorBot commented 3 weeks ago
Benchmark results on Amd ``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores Job-TAJACL : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-ZLJINB : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 ``` | Method | Toolchain | Mean | Ratio | |----------------- |------------------------ |---------:|------:| | Contains_Iterate | Before | 298.9 μs | 1.00 | | Contains_Iterate | After | 311.8 μs | 1.04 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_aa650167.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_aa650167.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_aa650167.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/3c2bcf01c7809edc5afcb0c961c56550) vs [PR](https://gist.github.com/EgorBot/12e6f0eb4f6ca5bbafb311fdf6cd4099) Hot functions: [Main](https://gist.github.com/EgorBot/a518c976304d5481fc988d00ca93a9de) vs [PR](https://gist.github.com/EgorBot/e1ec6b0688b0b7318e4f3ba2fd01d69b) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._
AndyAyersMS commented 3 weeks ago

@EgorBo that's the benchmark from #106624 -- not sure if you think it is also relevant here or you meant to run this over there.

EgorBo commented 3 weeks ago

@EgorBo that's the benchmark from #106624 -- not sure if you think it is also relevant here or you meant to run this over there.

Yes, sorry, it's not relevant here

mwadams commented 3 weeks ago

@AndyAyersMS

Here's the .NET 8 code

; Total bytes of code 375
; Assembly listing for method Corvus.Json.Benchmarking.GetStringWithValidationContextModifyingContext:ProcessString(System.ReadOnlySpan`1[ushort],byref,byref):bool (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rdi
       push     rsi

G_M000_IG02:                ;; offset=0x0002
       lea      rsi, bword ptr [rdx+0x08]
       mov      rdi, r8
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       movsq
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      eax, 1

G_M000_IG03:                ;; offset=0x002E
       pop      rsi
       pop      rdi
       ret

and here's the .NET 9 code

; Total bytes of code 372
; Assembly listing for method Corvus.Json.Benchmarking.GetStringWithValidationContextModifyingContext:ProcessString(System.ReadOnlySpan`1[ushort],byref,byref):ubyte (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       sub      rsp, 40

G_M000_IG02:                ;; offset=0x0004
       add      rdx, 8
       cmp      byte  ptr [rdx], dl
       cmp      byte  ptr [r8], r8b
       mov      rcx, r8
       mov      r8d, 56
       call     [CORINFO_HELP_BULK_WRITEBARRIER]
       mov      eax, 1

G_M000_IG03:                ;; offset=0x0021
       add      rsp, 40
       ret
mwadams commented 3 weeks ago

@AndyAyersMS:

Which looks to me like .NET 8 is copying the individual parts of the struct with the CORINFO_HELP_ASSIGNBYREF, and .NET 9 is trying a blit - which does look like the code in the issue you linked.

What are the additional comparisons doing?

To your point - we do a fair bit of struct copying (and the structs are fairly large but not huge); and they do contain gc-able fields. Immutable structs like this proved to be significantly faster than (gc-heavy) classes.

If we are paying that 30% extra cost for every immutable struct we return, that would easily account for the 3x perf hit.

AndyAyersMS commented 3 weeks ago

That sort of change is expected given https://github.com/dotnet/runtime/pull/101761. So the mystery is likely what about that change causes such a big perf impact on your HW.

When I run your benchmark from Repro106679-3.zip I see quite different code:

; Assembly listing for method Corvus.Json.Benchmarking.GetStringWithValidationContext:ProcessString(System.ReadOnlySpan`1[ushort],byref,byref):ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       mov      eax, dword ptr [rdx+0x08]
       mov      dword ptr [r9], eax
       mov      eax, 1

G_M000_IG03:                ;; offset=0x000B
       ret   

Just to confirm: you have this source for ProcessString? Or is it different?

   private bool ProcessString(ReadOnlySpan<char> span, in Corvus.Json.Validate.ValidationContextWrapper context, out int value)
    {
        value = span.Length;
        return true;
    }
}
mwadams commented 3 weeks ago

@AndyAyersMS I will run it again - on that run I actually had (note I had made the method static)

 private static bool ProcessString(ReadOnlySpan<char> span, in Corvus.Json.Validate.ValidationContextWrapper context, out ValidationContext result)
 {
     result = context.Context;
     return true;
 }

For

private bool ProcessString(ReadOnlySpan<char> span, in Corvus.Json.Validate.ValidationContextWrapper context, out ValidationContext result)
 {
     result = context.Context;
     return true;
 }

.NET 8

; Total bytes of code 241
; Assembly listing for method Corvus.Json.Benchmarking.GetStringWithValidationContextModifyingContext:ProcessString(System.ReadOnlySpan`1[ushort],byref,byref):bool:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rdi
       push     rsi

G_M000_IG02:                ;; offset=0x0002
       lea      rsi, bword ptr [r8+0x08]
       mov      rdi, r9
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       movsq
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      eax, 1

G_M000_IG03:                ;; offset=0x002E
       pop      rsi
       pop      rdi
       ret

.NET 9

; Total bytes of code 238
; Assembly listing for method Corvus.Json.Benchmarking.GetStringWithValidationContextModifyingContext:ProcessString(System.ReadOnlySpan`1[ushort],byref,byref):ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       sub      rsp, 40

G_M000_IG02:                ;; offset=0x0004
       lea      rdx, bword ptr [r8+0x08]
       cmp      byte  ptr [rdx], dl
       cmp      byte  ptr [r9], r9b
       mov      rcx, r9
       mov      r8d, 56
       call     [CORINFO_HELP_BULK_WRITEBARRIER]
       mov      eax, 1

G_M000_IG03:                ;; offset=0x0021
       add      rsp, 40
       ret

Unsurprisingly, this is similar :)

mwadams commented 3 weeks ago

@AndyAyersMS - you've run GetStringWithValidationContext not GetStringWithValidationContextModifyingContext on that occasion.

AndyAyersMS commented 3 weeks ago

Ah, that explains it. Let me rerun then.

mwadams commented 3 weeks ago

(It's a good test - because it is not passing the struct in that is the problem - it is passing it out again. GetStringWithValidationContext is faster on .NET 9 than .NET 8 as you'd hope!)

EgorBo commented 3 weeks ago

Can someone run this self-contained benchmark on the problematic hardware? It sort of simulates the bulk write barrier case:

@EgorBot -intel --runtimes net8.0 net9.0

using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<MyBench>(args: args);

public class MyBench
{
    [Benchmark]
    public void NotInHeap()
    {
        Struct1 s1 = new();
        Struct1 s2 = new();
        ByrefCopy(ref s1, s2);
    }

    Struct1 dst1;

    [Benchmark]
    public void InHeap_Empty()
    {
        ByrefCopy(ref dst1, default);
    }

    [Benchmark]
    public void InHeap_Ephemeral()
    {
        ByrefCopy(ref dst1, new Struct1(1, null, 3, new Struct2(4, null, 6, null)));
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public bool ByrefCopy(ref Struct1 dst, Struct1 src)
    {
        dst = src;
        return true;
    }
}

public record struct Struct1(
    object a1, object a2,
    long a3, Struct2 g);

public record struct Struct2(
    object a1, object a2, object a3,
    object a4);
EgorBot commented 3 weeks ago
Benchmark results on Intel ``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-QCOHAS : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-NFVYUV : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ``` | Method | Runtime | Mean | Error | Ratio | |----------------- |--------- |---------:|---------:|------:| | NotInHeap | .NET 8.0 | 14.04 ns | 0.144 ns | 1.00 | | NotInHeap | .NET 9.0 | 11.27 ns | 0.009 ns | 0.80 | | | | | | | | InHeap_Empty | .NET 8.0 | 16.16 ns | 0.017 ns | 1.00 | | InHeap_Empty | .NET 9.0 | 12.84 ns | 0.051 ns | 0.79 | | | | | | | | InHeap_Ephemeral | .NET 8.0 | 30.25 ns | 0.562 ns | 1.00 | | InHeap_Ephemeral | .NET 9.0 | 27.50 ns | 0.278 ns | 0.91 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_79fe3dba.zip)
mwadams commented 3 weeks ago

@EgorBo - here you go. All faster on .NET 9.0

Method Runtime Mean Error StdDev Ratio RatioSD
NotInHeap .NET 8.0 10.696 ns 0.1925 ns 0.1801 ns 1.00 0.02
NotInHeap .NET 9.0 4.518 ns 0.0917 ns 0.0766 ns 0.42 0.01
InHeap_Empty .NET 8.0 11.646 ns 0.1662 ns 0.1388 ns 1.00 0.02
InHeap_Empty .NET 9.0 7.239 ns 0.0557 ns 0.0435 ns 0.62 0.01
InHeap_Ephemeral .NET 8.0 20.402 ns 0.2355 ns 0.2088 ns 1.00 0.01
InHeap_Ephemeral .NET 9.0 13.120 ns 0.1106 ns 0.0980 ns 0.64 0.01
AndyAyersMS commented 3 weeks ago

I think we may be seeing AVX transition penalties at play. The bulk write barrier code uses SSE2, and the code just before it (ProcessString gets inlined because of delegate GDV) is using AVX2.

G_M000_IG13:                ;; offset=0x014F
       vmovdqu  ymm0, ymmword ptr [rsp+0x188]
       vmovdqu  ymmword ptr [rsp+0xB8], ymm0
       vmovdqu  ymm0, ymmword ptr [rsp+0x1A8]
       vmovdqu  ymmword ptr [rsp+0xD8], ymm0

G_M000_IG14:                ;; offset=0x0173
       lea      rcx, bword ptr [rsp+0x68]
       lea      rdx, [rsp+0xB0]
       mov      r8d, 72
       call     [CORINFO_HELP_BULK_WRITEBARRIER]
       mov      gword ptr [rsp+0x68], rbx
       mov      rcx, 0x1E0644014F8
       mov      rax, gword ptr [rcx]
       test     rax, rax
       je       G_M000_IG27

Or perhaps it has to do with store-forwarding, not sure which. Seems a bit odd we do a wide write to rsp+0xb8 and then pass rsp+0xb0 to the helper, which presumably does an overlapping read?

@mwadams what happens if you run with DOTNET_EnableAVX=0?

AndyAyersMS commented 3 weeks ago

FWIW the native helper code is here:

https://github.com/dotnet/runtime/blob/d2c19db316b3757c5df60224b55fdc60d6ff69c3/src/coreclr/classlibnative/bcltype/arraynative.inl#L16-L331

mwadams commented 3 weeks ago

With

image

And running GetStringWithValidationContextModifyingContext

Method Runtime Mean Error StdDev Median Ratio RatioSD
ProcessString .NET 8.0 121.0 ns 2.29 ns 1.79 ns 120.23 ns 1.00 0.02
ProcessString .NET 9.0 105.3 ns 4.13 ns 12.18 ns 98.36 ns 0.87 0.10

So - I think you're right! That's way faster on both .NET 8.0 and .NET 9.0 - and significantly faster on .NET 9.