Open BlackxSnow opened 1 month ago
On .NET 8, both the add operators are compiled into one assembly.
add dword ptr [rbx+0x08], r15d
While on .NET 7, it's
mov edi, r14d
add edi, dword ptr [rbx+08H]
mov dword ptr [rbx+08H], edi
and
mov edi, dword ptr [rbx+08H]
add edi, r14d
mov dword ptr [rbx+08H], edi
, respectively.
https://godbolt.org/z/51ooerfGr
I'm unsure why the first one would be slower.
It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.
On my Ice Lake-SP there's merely no difference:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
| Method | Runtime | Mean | Error | StdDev |
|-------------------------------------- |--------- |---------:|----------:|----------:|
| Property_ReadWrite_Write_Add | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us |
| Property_ReadWrite_Write_Add | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us |
| Property_ReadWrite_Write_Add | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us |
| Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |
Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.
.net 7.0:
| Method | Mean | Error | StdDev |
|-------------------------------------- |-----------:|--------:|--------:|
| Property_ReadWrite_Write_Add | 1,379.6 ns | 6.49 ns | 5.42 ns |
| Property_ReadWrite_Write_Add_Separate | 196.1 ns | 2.78 ns | 2.47 ns |
.net 8.0 (same on net9.0):
| Method | Mean | Error | StdDev |
|-------------------------------------- |---------:|--------:|--------:|
| Property_ReadWrite_Write_Add | 194.6 ns | 2.45 ns | 2.17 ns |
| Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |
so looks like everything is okay?
@EgorBot -intel -arm64 --runtimes net7.0 net8.0 net9.0
using BenchmarkDotNet.Attributes;
public class FieldVsProperty
{
public int Prop_ReadWrite { get; set; } = Random.Shared.Next();
public static int N = 1000;
[Benchmark]
public int Property_ReadWrite_Write_Add()
{
for (int i = 0; i < N; i++)
{
Prop_ReadWrite += i;
}
return Prop_ReadWrite;
}
[Benchmark]
public int Property_ReadWrite_Write_Add_Separate()
{
for (int i = 0; i < N; i++)
{
var val = Prop_ReadWrite;
Prop_ReadWrite = val + i;
}
return Prop_ReadWrite;
}
}
I can reproduce the same regression on Raptor Lake:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job-ZJTSND : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
Job-EGSVEF : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
| Method | Runtime | Mean | Error | StdDev | Code Size |
|-------------------------------------- |--------- |-----------:|---------:|---------:|----------:|
| Property_ReadWrite_Write_Add | .NET 6.0 | 237.5 ns | 1.08 ns | 0.90 ns | 65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 215.0 ns | 0.66 ns | 0.61 ns | 65 B |
| Property_ReadWrite_Write_Add | .NET 8.0 | 1,254.5 ns | 3.28 ns | 2.91 ns | 25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,250.5 ns | 24.55 ns | 25.21 ns | 25 B |
when affinitized to E-Cores:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job-FAOOUD : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
Job-PUVIKJ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Affinity=00000000000000010000000000000000
| Method | Runtime | Mean | Error | StdDev | Code Size |
|-------------------------------------- |--------- |---------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add | .NET 6.0 | 440.0 ns | 7.06 ns | 5.52 ns | 65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 428.0 ns | 2.43 ns | 2.03 ns | 65 B |
| Property_ReadWrite_Write_Add | .NET 8.0 | 341.3 ns | 4.36 ns | 3.86 ns | 25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 346.7 ns | 6.78 ns | 9.28 ns | 25 B |
Apparently there's something unhappy with the Golden Cove cores. E-Cores performs much better than P-Cores!
I remember reading something in Intel's optimization guide that newer CPU models will fuse mov {reg1}, [mem]; {op} {reg1}, {reg2}
into a three-operand non-destructive form of {op} {reg1}, [mem], {reg2}
, which bypasses a register rename holding up retirement of {reg1}
. Perhaps swapping the memory operand's position inhibits this optimization?
It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.
On my Ice Lake-SP there's merely no difference:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3) Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 9.0.100-rc.1.24452.12 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2 Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL | Method | Runtime | Mean | Error | StdDev | |-------------------------------------- |--------- |---------:|----------:|----------:| | Property_ReadWrite_Write_Add | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us | | Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us | | Property_ReadWrite_Write_Add | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us | | Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us | | Property_ReadWrite_Write_Add | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us | | Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |
Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.
Your results exclude .NET 7. I'm wondering whether you see the same bump in execution time for Add
(compared to Add_Separate
that I do.
Your results exclude .NET 7.
I just executed on what I have on my machine. 6.0 represents for pre-8.0 which doesn't include the codegen change.
I'm wondering whether you see the same bump in execution time for
Add
(compared toAdd_Separate
that I do.
The behavior seems consistent for each micro-architecture.
Ice Lake-SP, Gracemont: everything looks fine.
Zen 4: Significantly slow for the pre-8.0 Add
codegen, fine for others.
Golden Cove: Significantly slow for the post-8.0 codegen.
@BruceForstall, PTAL when we get Meteor lake laptops this year. cc @dotnet/jit-contrib.
.net 7.0:
| Method | Mean | Error | StdDev | |-------------------------------------- |-----------:|--------:|--------:| | Property_ReadWrite_Write_Add | 1,379.6 ns | 6.49 ns | 5.42 ns | | Property_ReadWrite_Write_Add_Separate | 196.1 ns | 2.78 ns | 2.47 ns |
.net 8.0 (same on net9.0):
| Method | Mean | Error | StdDev | |-------------------------------------- |---------:|--------:|--------:| | Property_ReadWrite_Write_Add | 194.6 ns | 2.45 ns | 2.17 ns | | Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |
so looks like everything is okay?
What architecture were these run on? Yours are the only results I've seen that mimic my system.
What architecture were these run on? Yours are the only results I've seen that mimic my system.
I remember Egor uses R9-7950X. It's also Zen 4.
(I originally detailed this issue on StackOverflow, here.)
Description
The following two snippets produce wildly different benchmark results to eachother as well as between different machines and major runtime versions (where
SomeProperty
is anint
auto-property):The benchmark (below), when run on my machine, showed poor performance of the former case on .NET 7 but otherwise expected results. 2 others ran the benchmarks, resulting in poor performance for both cases on .NET 8 but not .NET 7. The host version did not appear to make a difference in these cases. I've included the benchmark results and system configurations below the benchmark code.
Potentially relevantly (but not directly related), I've noticed (but not been able to isolate) significant performance issues with setting data through a native memory pointer provided by mapping a Direct3D sub-resource which wasn't present on .NET 8 or any of my colleague's machines on .NET 7. That issue appears to be more strongly linked to number of assignments to the pointer than to the amount of data assigned.
Benchmark
Data
My machine (also ran this on my Arch Linux install, with no notable difference):
The two other machines:
Analysis
The most notable difference is between the CPU vendors, but the data is pretty limited.