dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Significant performance difference between x += y and x = x + y on properties, differing between hardware and runtime version (7 / 8) #108227

Open BlackxSnow opened 1 month ago

BlackxSnow commented 1 month ago

(I originally detailed this issue on StackOverflow, here.)

Description

The following two snippets produce wildly different benchmark results to eachother as well as between different machines and major runtime versions (where SomeProperty is an int auto-property):

SomeProperty += i;
var propertyValue = SomeProperty;
SomeProperty = propertyValue + i;

The benchmark (below), when run on my machine, showed poor performance of the former case on .NET 7 but otherwise expected results. 2 others ran the benchmarks, resulting in poor performance for both cases on .NET 8 but not .NET 7. The host version did not appear to make a difference in these cases. I've included the benchmark results and system configurations below the benchmark code.

Potentially relevantly (but not directly related), I've noticed (but not been able to isolate) significant performance issues with setting data through a native memory pointer provided by mapping a Direct3D sub-resource which wasn't present on .NET 8 or any of my colleague's machines on .NET 7. That issue appears to be more strongly linked to number of assignments to the pointer than to the amount of data assigned.

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace Benchmarks;

[DisassemblyDiagnoser(maxDepth: 1, printSource:true)]
[Config(typeof(Config))]
public class FieldVsProperty
{
    public int Prop_ReadWrite { get; set; } = Random.Shared.Next();

    public static int N = 1000;

    [Benchmark]
    public int Property_ReadWrite_Write_Add()
    {
        for (int i = 0; i < N; i++)
        {
            Prop_ReadWrite += i;
        }
        return Prop_ReadWrite;
    }
    [Benchmark]
    public int Property_ReadWrite_Write_Add_Separate()
    {
        for (int i = 0; i < N; i++)
        {
            var val = Prop_ReadWrite;
            Prop_ReadWrite = val + i;
        }
        return Prop_ReadWrite;
    }

    private class Config : ManualConfig
    {
        public Config()
        {
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70));
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core80));
        }
    }
}

Data

My machine (also ran this on my Arch Linux install, with no notable difference):

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4412/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XRUBGY : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
  Job-TPIWHS : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

| Method                                | Runtime  | Mean       | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |-----------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 | 1,352.2 ns | 5.95 ns | 5.57 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   186.9 ns | 0.53 ns | 0.50 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 |   186.8 ns | 0.45 ns | 0.40 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 |   186.9 ns | 0.40 ns | 0.35 ns |      25 B |

The two other machines:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12900HK, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 7.0.16 (7.0.1624.6629), X64 RyuJIT AVX2
  Job-TVXXNG : .NET 7.0.16 (7.0.1624.6629), X64 RyuJIT AVX2
  Job-YOOWAN : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2

| Method                                | Runtime  | Mean       | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |-----------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 |   284.5 ns | 5.59 ns | 8.19 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   256.1 ns | 3.04 ns | 2.54 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,488.2 ns | 6.57 ns | 5.49 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,496.4 ns | 7.92 ns | 7.02 ns |      25 B |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
12th Gen Intel Core i7-12650H, 1 CPU, 16 logical and 10 physical cores
.NET SDK 8.0.303
  [Host]     : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  Job-WMSFHF : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  Job-HBRVHQ : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

| Method                                | Runtime  | Mean       | Error    | StdDev   | Code Size |
|-------------------------------------- |--------- |-----------:|---------:|---------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 |   407.7 ns |  8.21 ns | 16.40 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   348.1 ns |  6.68 ns |  7.70 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,878.6 ns | 36.54 ns | 51.22 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,870.5 ns | 37.06 ns | 42.68 ns |      25 B |

Analysis

The most notable difference is between the CPU vendors, but the data is pretty limited.

karakasa commented 1 month ago

On .NET 8, both the add operators are compiled into one assembly.

       add dword ptr [rbx+0x08], r15d

While on .NET 7, it's

       mov      edi, r14d
       add      edi, dword ptr [rbx+08H]
       mov      dword ptr [rbx+08H], edi

and

       mov      edi, dword ptr [rbx+08H]
       add      edi, r14d
       mov      dword ptr [rbx+08H], edi

, respectively.

https://godbolt.org/z/51ooerfGr

I'm unsure why the first one would be slower.

huoyaoyuan commented 1 month ago

It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.

On my Ice Lake-SP there's merely no difference:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method                                | Runtime  | Mean     | Error     | StdDev    |
|-------------------------------------- |--------- |---------:|----------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us |
| Property_ReadWrite_Write_Add          | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us |
| Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |

Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.

EgorBo commented 1 month ago

.net 7.0:

| Method                                | Mean       | Error   | StdDev  |
|-------------------------------------- |-----------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 1,379.6 ns | 6.49 ns | 5.42 ns |
| Property_ReadWrite_Write_Add_Separate |   196.1 ns | 2.78 ns | 2.47 ns |

.net 8.0 (same on net9.0):

| Method                                | Mean     | Error   | StdDev  |
|-------------------------------------- |---------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 194.6 ns | 2.45 ns | 2.17 ns |
| Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |

so looks like everything is okay?

EgorBo commented 1 month ago

@EgorBot -intel -arm64 --runtimes net7.0 net8.0 net9.0

using BenchmarkDotNet.Attributes;

public class FieldVsProperty
{
    public int Prop_ReadWrite { get; set; } = Random.Shared.Next();

    public static int N = 1000;

    [Benchmark]
    public int Property_ReadWrite_Write_Add()
    {
        for (int i = 0; i < N; i++)
        {
            Prop_ReadWrite += i;
        }
        return Prop_ReadWrite;
    }
    [Benchmark]
    public int Property_ReadWrite_Write_Add_Separate()
    {
        for (int i = 0; i < N; i++)
        {
            var val = Prop_ReadWrite;
            Prop_ReadWrite = val + i;
        }
        return Prop_ReadWrite;
    }
}
huoyaoyuan commented 1 month ago

I can reproduce the same regression on Raptor Lake:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-ZJTSND : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-EGSVEF : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

| Method                                | Runtime  | Mean       | Error    | StdDev   | Code Size |
|-------------------------------------- |--------- |-----------:|---------:|---------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 |   237.5 ns |  1.08 ns |  0.90 ns |      65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 |   215.0 ns |  0.66 ns |  0.61 ns |      65 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,254.5 ns |  3.28 ns |  2.91 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,250.5 ns | 24.55 ns | 25.21 ns |      25 B |

when affinitized to E-Cores:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-FAOOUD : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-PUVIKJ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Affinity=00000000000000010000000000000000

| Method                                | Runtime  | Mean     | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |---------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 440.0 ns | 7.06 ns | 5.52 ns |      65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 428.0 ns | 2.43 ns | 2.03 ns |      65 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 341.3 ns | 4.36 ns | 3.86 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 346.7 ns | 6.78 ns | 9.28 ns |      25 B |

Apparently there's something unhappy with the Golden Cove cores. E-Cores performs much better than P-Cores!

colejohnson66 commented 1 month ago

I remember reading something in Intel's optimization guide that newer CPU models will fuse mov {reg1}, [mem]; {op} {reg1}, {reg2} into a three-operand non-destructive form of {op} {reg1}, [mem], {reg2}, which bypasses a register rename holding up retirement of {reg1}. Perhaps swapping the memory operand's position inhibits this optimization?

BlackxSnow commented 1 month ago

It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.

On my Ice Lake-SP there's merely no difference:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method                                | Runtime  | Mean     | Error     | StdDev    |
|-------------------------------------- |--------- |---------:|----------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us |
| Property_ReadWrite_Write_Add          | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us |
| Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |

Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.

Your results exclude .NET 7. I'm wondering whether you see the same bump in execution time for Add (compared to Add_Separate that I do.

huoyaoyuan commented 1 month ago

Your results exclude .NET 7.

I just executed on what I have on my machine. 6.0 represents for pre-8.0 which doesn't include the codegen change.

I'm wondering whether you see the same bump in execution time for Add (compared to Add_Separate that I do.

The behavior seems consistent for each micro-architecture.

Ice Lake-SP, Gracemont: everything looks fine. Zen 4: Significantly slow for the pre-8.0 Add codegen, fine for others. Golden Cove: Significantly slow for the post-8.0 codegen.

JulieLeeMSFT commented 1 month ago

@BruceForstall, PTAL when we get Meteor lake laptops this year. cc @dotnet/jit-contrib.

BlackxSnow commented 1 month ago

.net 7.0:

| Method                                | Mean       | Error   | StdDev  |
|-------------------------------------- |-----------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 1,379.6 ns | 6.49 ns | 5.42 ns |
| Property_ReadWrite_Write_Add_Separate |   196.1 ns | 2.78 ns | 2.47 ns |

.net 8.0 (same on net9.0):

| Method                                | Mean     | Error   | StdDev  |
|-------------------------------------- |---------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 194.6 ns | 2.45 ns | 2.17 ns |
| Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |

so looks like everything is okay?

What architecture were these run on? Yours are the only results I've seen that mimic my system.

huoyaoyuan commented 1 month ago

What architecture were these run on? Yours are the only results I've seen that mimic my system.

I remember Egor uses R9-7950X. It's also Zen 4.