dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

Unexpected performance loss when using Vector<T> with different code order #77092

Open AsakusaRinne opened 2 years ago

AsakusaRinne commented 2 years ago

Description

I'm trying to use Vector<T> and Vector256<T> recently. Most of the things work well but a strange performance loss of Vector<T> appears on my Ubuntu Server. The performance of Vector<T> became nearly 3 times slower when I simply exchange the order of two lines of my code. Furthermore, In my Windows machine, this exchange did not cause obvious performance loss, which is very confusing.

Configuration

On Linux, my configuration is listed as below. .NET7.0 RC and .NET 6.0 are both tested.

BenchmarkDotNet=v0.13.2, OS=ubuntu 22.04
Intel Xeon Gold 6148 CPU 2.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203) and .NET 6.0.10 (6.0.1022.47605), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.47203) and .NET 6.0.10 (6.0.1022.47605), X64 RyuJIT AVX2

On Windows, my configuration is listed as below.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
12th Gen Intel Core i7-12700, 1 CPU, 20 logical and 12 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT AVX2

Regression?

Sorry that I'm not sure.

Data

In my benchmark test, I take the way of using Vector256 as a comparison and test the performance of Vector<T>. I'll give out the main body of my code and result first, then append the full code at the end of this section.

The main part of the test is the code below.

public struct TestStruct<TData, TWrapper> where TData : unmanaged where TWrapper : Addable<TData>, new()
{
    public void TestVector(TData[] data, ref TData sum, int count)
    {
        TWrapper wrapper = new();  // Line A
        var r = new Vector<TData>();  // Line B
        int i = 0;
        for (; i < count - 8; i += 8)
        {
            var v = new Vector<TData>(data, i);
            r += v;
        }
        var r256 = r.AsVector256();
        for (int j = 0; j < 8; j++)
        {
            sum = wrapper.Add(sum, r256.GetElement(j));
        }
        for (; i < count; i++)
        {
            sum = wrapper.Add(sum, data[i]);
        }
    }
}

public interface Addable<T> where T : unmanaged
{
    T Add(T a, T b);
}

The result on my Linux server is listed as below (.NET 7). The result of .NET 6 is almost the same.

Method Count Mean Error StdDev
VectorT 100 80.20 ns 1.642 ns 2.135 ns
Vector256 100 53.91 ns 1.008 ns 1.035 ns
VectorT 1000 271.49 ns 1.463 ns 1.368 ns
Vector256 1000 238.35 ns 1.763 ns 1.649 ns
VectorT 200000 33,103.32 ns 190.716 ns 178.396 ns
Vector256 200000 32,928.86 ns 102.214 ns 90.610 ns

However, if we just exchange Line A and Line B above, the performance of Vector<T> becomes much slower:

Method Count Mean Error StdDev
VectorT 100 117.86 ns 0.795 ns 0.705 ns
Vector256 100 59.31 ns 0.714 ns 0.668 ns
VectorT 1000 615.73 ns 3.470 ns 3.245 ns
Vector256 1000 236.52 ns 1.142 ns 1.012 ns
VectorT 200000 82,712.72 ns 306.849 ns 287.027 ns
Vector256 200000 32,898.00 ns 85.778 ns 76.040 ns

What's more, this exchange has little impact on my Windows machine, which is very confusing.

Method Count Mean Error StdDev
VectorT 100 33.19 ns 0.342 ns 0.320 ns
Vector256 100 20.86 ns 0.061 ns 0.054 ns
VectorT 1000 84.42 ns 0.618 ns 0.516 ns
Vector256 1000 59.56 ns 0.654 ns 0.579 ns
VectorT 200000 10,613.39 ns 56.762 ns 47.399 ns
Vector256 200000 10,769.77 ns 139.315 ns 130.315 ns

The integral code of my benchmark test is listed as below.

using System;
using System.Data;
using System.Diagnostics.CodeAnalysis;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

var summary = BenchmarkRunner.Run(typeof(Program).Assembly);
Console.WriteLine(summary);
public class UnitTestCompare
{
    [Params(100, 1000, 200000)]
    public int Count { get; set; }
    public float[] Data { get; set; }
    public float sum = 0;
    TestStruct<float, FloatMethodWrapper> t;
    [GlobalSetup]
    public void Setup()
    {
        Data = new float[Count];
        for (int i = 0; i < Count; i++)
        {
            Data[i] = new Random().NextSingle();
        }
    }
    [Benchmark]
    public void VectorT()
    {
        t.TestVector(Data, ref sum, Count);
    }
    [Benchmark]
    public unsafe void Vector256()
    {
        float[] res = new float[8];
        fixed (float* p = res)
        {
            fixed (float* d = Data)
            {
                var r = Avx2.LoadVector256(p);
                int i = 0;
                for (; i < Count - 8; i += 8)
                {
                    var v = Avx2.LoadVector256(d + i);
                    r = Avx2.Add(r, v);
                }
                unchecked
                {
                    for (int j = 0; j < 8; j++)
                    {
                        sum += r.GetElement(j);
                    }
                    for (; i < Count; i++)
                    {
                        sum += Data[i];
                    }
                }
            }
        }
    }
}

public struct TestStruct<TData, TWrapper> where TData : unmanaged where TWrapper : Addable<TData>, new()
{
    public void TestVector(TData[] data, ref TData sum, int count)
    {
        TWrapper wrapper = new();
        var r = new Vector<TData>();
        int i = 0;
        for (; i < count - 8; i += 8)
        {
            var v = new Vector<TData>(data, i);
            r += v;
        }
        var r256 = r.AsVector256();
        for (int j = 0; j < 8; j++)
        {
            sum = wrapper.Add(sum, r256.GetElement(j));
        }
        for (; i < count; i++)
        {
            sum = wrapper.Add(sum, data[i]);
        }
    }
}

public interface Addable<T> where T : unmanaged
{
    T Add(T a, T b);
}

public class FloatMethodWrapper : Addable<float>
{
    public float Add(float a, float b)
    {
        unchecked { return a + b; }
    }
}

Besides, I also tried to add [MethodImpl(MethodImplOptions.AggressiveOptimization)] and [MethodImpl(MethodImplOptions.AggressiveInlining)] but they did not work.

Analysis

Sadly I'm too confused to figure out the problem. I regard Vector<T> as a wrapping of Vector256 when the avx is supported. Thus the performance of Vector<T> should be slightly slower than Vector256 with small data and be close to Vector256 with large data. Is that right?

I will appreciate it if anyone could help to explain it or share some references to me.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I'm trying to use ```Vector``` and ```Vector256``` recently. Most of the things work well but a strange performance loss of ```Vector``` appears on my Ubuntu Server. The performance of ```Vector``` became nearly 3 times slower when I simply exchange the order of two lines of my code. Furthermore, In my Windows machine, this exchange did not cause obvious performance loss, which is very confusing. ### Configuration On Linux, my configuration is listed as below. `.NET7.0 RC` and `.NET 6.0` are both tested. ``` BenchmarkDotNet=v0.13.2, OS=ubuntu 22.04 Intel Xeon Gold 6148 CPU 2.40GHz, 1 CPU, 2 logical and 2 physical cores .NET SDK=7.0.100-rc.2.22477.23 [Host] : .NET 7.0.0 (7.0.22.47203) and .NET 6.0.10 (6.0.1022.47605), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.47203) and .NET 6.0.10 (6.0.1022.47605), X64 RyuJIT AVX2 ``` On Windows, my configuration is listed as below. ``` BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2) 12th Gen Intel Core i7-12700, 1 CPU, 20 logical and 12 physical cores .NET SDK=7.0.100-preview.5.22307.18 [Host] : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT AVX2 ``` ### Regression? Sorry that I'm not sure. ### Data In my benchmark test, I take the way of using `Vector256` as a comparison and test the performance of `Vector`. I'll give out the main body of my code and result first, then append the full code at the end of this section. The main part of the test is the code below. ```cs public struct TestStruct where TData : unmanaged where TWrapper : Addable, new() { public void TestVector(TData[] data, ref TData sum, int count) { TWrapper wrapper = new(); // Line A var r = new Vector(); // Line B int i = 0; for (; i < count - 8; i += 8) { var v = new Vector(data, i); r += v; } var r256 = r.AsVector256(); for (int j = 0; j < 8; j++) { sum = wrapper.Add(sum, r256.GetElement(j)); } for (; i < count; i++) { sum = wrapper.Add(sum, data[i]); } } } public interface Addable where T : unmanaged { T Add(T a, T b); } ``` The result on my Linux server is listed as below (.NET 7). The result of .NET 6 is almost the same. | Method | Count | Mean | Error | StdDev | |-------------------- |------- |-------------:|-----------:|-----------:| | VectorT | 100 | 80.20 ns | 1.642 ns | 2.135 ns | | Vector256 | 100 | 53.91 ns | 1.008 ns | 1.035 ns | | VectorT | 1000 | 271.49 ns | 1.463 ns | 1.368 ns | | Vector256 | 1000 | 238.35 ns | 1.763 ns | 1.649 ns | | VectorT | 200000 | 33,103.32 ns | 190.716 ns | 178.396 ns | | Vector256 | 200000 | 32,928.86 ns | 102.214 ns | 90.610 ns | However, if we just exchange **Line A** and **Line B** above, the performance of `Vector` becomes much slower: | Method | Count | Mean | Error | StdDev | |-------------------- |------- |-------------:|-----------:|-----------:| | VectorT | 100 | 117.86 ns | 0.795 ns | 0.705 ns | | Vector256 | 100 | 59.31 ns | 0.714 ns | 0.668 ns | | VectorT | 1000 | 615.73 ns | 3.470 ns | 3.245 ns | | Vector256 | 1000 | 236.52 ns | 1.142 ns | 1.012 ns | | VectorT | 200000 | 82,712.72 ns | 306.849 ns | 287.027 ns | | Vector256 | 200000 | 32,898.00 ns | 85.778 ns | 76.040 ns | What's more, this exchange has little impact on my Windows machine, which is very confusing. | Method | Count | Mean | Error | StdDev | |-------------------- |------- |-------------:|-----------:|-----------:| | VectorT | 100 | 33.19 ns | 0.342 ns | 0.320 ns | | Vector256 | 100 | 20.86 ns | 0.061 ns | 0.054 ns | | VectorT | 1000 | 84.42 ns | 0.618 ns | 0.516 ns | | Vector256 | 1000 | 59.56 ns | 0.654 ns | 0.579 ns | | VectorT | 200000 | 10,613.39 ns | 56.762 ns | 47.399 ns | | Vector256 | 200000 | 10,769.77 ns | 139.315 ns | 130.315 ns | The integral code of my benchmark test is listed as below. ```cs using System; using System.Data; using System.Diagnostics.CodeAnalysis; using System.Numerics; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; var summary = BenchmarkRunner.Run(typeof(Program).Assembly); Console.WriteLine(summary); public class UnitTestCompare { [Params(100, 1000, 200000)] public int Count { get; set; } public float[] Data { get; set; } public float sum = 0; TestStruct t; [GlobalSetup] public void Setup() { Data = new float[Count]; for (int i = 0; i < Count; i++) { Data[i] = new Random().NextSingle(); } } [Benchmark] public void VectorT() { t.TestVector(Data, ref sum, Count); } [Benchmark] public unsafe void Vector256() { float[] res = new float[8]; fixed (float* p = res) { fixed (float* d = Data) { var r = Avx2.LoadVector256(p); int i = 0; for (; i < Count - 8; i += 8) { var v = Avx2.LoadVector256(d + i); r = Avx2.Add(r, v); } unchecked { for (int j = 0; j < 8; j++) { sum += r.GetElement(j); } for (; i < Count; i++) { sum += Data[i]; } } } } } } public struct TestStruct where TData : unmanaged where TWrapper : Addable, new() { public void TestVector(TData[] data, ref TData sum, int count) { TWrapper wrapper = new(); var r = new Vector(); int i = 0; for (; i < count - 8; i += 8) { var v = new Vector(data, i); r += v; } var r256 = r.AsVector256(); for (int j = 0; j < 8; j++) { sum = wrapper.Add(sum, r256.GetElement(j)); } for (; i < count; i++) { sum = wrapper.Add(sum, data[i]); } } } public interface Addable where T : unmanaged { T Add(T a, T b); } public class FloatMethodWrapper : Addable { public float Add(float a, float b) { unchecked { return a + b; } } } ``` Besides, I also tried to add `[MethodImpl(MethodImplOptions.AggressiveOptimization)]` and `[MethodImpl(MethodImplOptions.AggressiveInlining)]` but they did not work. ### Analysis Sadly I'm too confused to figure out the problem. I regard `Vector` as a wrapping of `Vector256` when the `avx` is supported. Thus the performance of `Vector` should be slightly slower than `Vector256` with small data and be close to `Vector256` with large data. Is that right? I will appreciate it if anyone could help to explain it or share some references to me.
Author: AsakusaRinne
Assignees: -
Labels: `area-System.Runtime.Intrinsics`, `tenet-performance`, `untriaged`
Milestone: -
gfoidl commented 2 years ago

Beside the codegen-thing and as you're on .NET 7: Instead of TWrapper wrapper = new(); you could use static abstract interfaces (coming with C# 11 and .NET 7) to avoid that line entirely.

AsakusaRinne commented 2 years ago

Yes, thank you, It's a nice feature of C# 11 and I'm updating my library to C# 11 recently. The performance loss could be avoided, however I think the reason behind this behavior may be interesting :)

By the way, If I need to keep my library compatible with previous .NET core versions, I cannot use new interfaces like IAddationOperators, which is implemented by all the number types. Do you think there's still a good way to use static abstract interface to optimize the code? My library were full of code like TWrapper above and that's annoying.🤣