dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.48k stars 4.77k forks source link

TensorPrimitives does not exploit its full potential #101326

Open msedi opened 7 months ago

msedi commented 7 months ago

Description

We are doing a lot of data processing with large arrays. In the .NET Framework we had to create our own library (ArrayMath) to prevent repetitive code. We started using pointer base arithmetics since in the beginning regular loops were too slow. Then we want over to Vector and Span which now brings us the best performance.

Since the introduction of TensorPrimitives I thought we can get rid of our own implementation and made a few benchmarks. I found that our Max implementation was really poor (TensorPrimitives is almost 4-5x faster) since we only implemented it with Vector128.

So I did further benchmarks with ArrayMath.MulAdd and TensorPrimitives.MultiplyAdd. For an images of 512x512.

So this is currently the result:

| Method                    | Mean      | Error    | StdDev    | Median    | Allocated |
|-------------------------- |----------:|---------:|----------:|----------:|----------:|
| Add_With_ArrayMath        |  89.55 us | 4.863 us | 14.339 us |  79.59 us |         - |
| Add_With_TensorPrimitives | 154.35 us | 2.870 us |  4.951 us | 151.50 us |         - |

Configuration

BenchmarkDotNet v0.13.13-nightly.20240213.132, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

Regression?

There is currently no regression, but I think it would be worth checking this.

Data

Here's the benchmark.NET code

  [MemoryDiagnoser]
  public class TensorPrimitivesBenchmark
  {
      public float[] Image1;
      public float[] Image2;
      public float[] Image3;
      public const int N = 512;

      [GlobalSetup]
      public void GlobalSetup()
      {
          Image1 = new float[N * N];
          Image2 = new float[N * N];
          Image3 = new float[N * N];

          for (int i=0; i<Image1.Length;i++)
          {
              Image1[i] = i;
              Image2[i] = i;
              Image3[i] = i;
          }
      }

      [Benchmark]
      public void Add_With_ArrayMath() 
          => ArrayMath.MulAdd(Image1, Image2, Image3, Image1);

      [Benchmark]
      public void Add_With_TensorPrimitives()
          => TensorPrimitives.MultiplyAdd(Image1, Image2, Image3, Image1);    
  }

And this is how our ArrayMath.MulAdd looks like.

 [MethodImpl(MethodImplOptions.AggressiveOptimization)]
 [SkipLocalsInit]
 public static Span<T> MulAdd<T>(in ReadOnlySpan<T> A, in ReadOnlySpan<T> B, in ReadOnlySpan<T> C, Span<T> result = default) where T : struct
 {
     result = New(A, B, C, result);

     int i = 0;
     int N = A.Length;

     if (IsVectorSupported<T>.AddMul)
     {
         ReadOnlySpan<Vector<T>> aAVec = MemoryMarshal.Cast<T, Vector<T>>(A);

         if (aAVec.Length > 0)
         {
             int S = Vector<T>.Count;
             int R = N / S;
             i = R * S;

             ReadOnlySpan<Vector<T>> aBVec = MemoryMarshal.Cast<T, Vector<T>>(B);
             ReadOnlySpan<Vector<T>> aCVec = MemoryMarshal.Cast<T, Vector<T>>(C);
             Span<Vector<T>> aResVec = MemoryMarshal.Cast<T, Vector<T>>(result);

             for (int idx = 0; idx < R; idx++)
                 aResVec[idx] = (aAVec[idx] * aBVec[idx]) + aCVec[idx];
         }
     }

     do
     {
         if ((uint)i >= (uint)N)
             break;

         result[i] = MathUtil.MulAdd(A[i], B[i], C[i]);
         i++;
     }
     while (true);

     return result;
 }

It seems that with tremendously less effort (no offense!) in our library we get faster results compared to the TensorPrimitives, which I find a bit sad that the efforts in TensorPrimitives do not pay out. Maybe there's a good explanation, but I couldn't find one.

dotnet-policy-service[bot] commented 7 months ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

tannergooding commented 7 months ago

Short Response

This is definitely something for us to look more into, but the difference seen can largely be summed up as a quirk of how the benchmark was written and the results being heavily skewed by that.

Long Response

Collapsed since this is indeed long and goes into a deeper dive into how this all works. > It seems that with tremendously less effort (no offense!) in our library we get faster results compared to the TensorPrimitives, which I find a bit sad that the efforts in TensorPrimitives do not pay out. Maybe there's a good explanation, but I couldn't find one. It's somewhat problematic to run a single microbenchmark and take its result as proof that another implementation is faster. This is not factoring in that there can be a significant difference for the results when considering the total set of targetable hardware, input sizes, contention levels, etc. For the below, I went into this having a good guess as to what the issue was, but I've walked through it all below to hopefully help others have a better understanding as well. ### Initial benchmark It will be easier to give more exact comparison if you give the implementation of `New(...)` and `MathUtil.MulAdd(...)` so that a benchmark can be reproduced locally, however I'm assuming below that `New(...)` does something like `(result == default) ? new T[...] : result` and that `MathUtil.MulAdd(...)` does something like `(A[i] * B[i]) + C[i]`... Given that, I ran the benchmark for various input sizes: `[1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144]` giving the following: ``` // * Summary * BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3) AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores .NET SDK 9.0.100-preview.3.24204.13 [Host] : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ``` | Method | Length | Mean | Error | StdDev | Allocated | |-------------------------- |--------- |--------------:|------------:|------------:|----------:| | Add_With_ArrayMath | 1 | 4.191 ns | 0.0734 ns | 0.0687 ns | - | | Add_With_TensorPrimitives | 1 | 4.769 ns | 0.0176 ns | 0.0156 ns | - | | Add_With_ArrayMath | 4 | 6.182 ns | 0.0745 ns | 0.0697 ns | - | | Add_With_TensorPrimitives | 4 | 4.756 ns | 0.0636 ns | 0.0595 ns | - | | Add_With_ArrayMath | 16 | 6.117 ns | 0.0578 ns | 0.0541 ns | - | | Add_With_TensorPrimitives | 16 | 8.756 ns | 0.0748 ns | 0.0699 ns | - | | Add_With_ArrayMath | 64 | 10.873 ns | 0.0782 ns | 0.0653 ns | - | | Add_With_TensorPrimitives | 64 | 11.423 ns | 0.0627 ns | 0.0556 ns | - | | Add_With_ArrayMath | 256 | 27.721 ns | 0.5426 ns | 0.4810 ns | - | | Add_With_TensorPrimitives | 256 | 21.301 ns | 0.0922 ns | 0.0862 ns | - | | Add_With_ArrayMath | 1024 | 93.974 ns | 1.0594 ns | 0.9392 ns | - | | Add_With_TensorPrimitives | 1024 | 62.080 ns | 0.7017 ns | 0.6563 ns | - | | Add_With_ArrayMath | 4096 | 479.508 ns | 2.6640 ns | 2.4919 ns | - | | Add_With_TensorPrimitives | 4096 | 305.224 ns | 1.1319 ns | 1.0588 ns | - | | Add_With_ArrayMath | 16384 | 1,800.037 ns | 7.8216 ns | 7.3163 ns | - | | Add_With_TensorPrimitives | 16384 | 1,168.430 ns | 11.4901 ns | 10.7478 ns | - | | Add_With_ArrayMath | 65536 | 7,275.639 ns | 46.2484 ns | 36.1077 ns | - | | Add_With_TensorPrimitives | 65536 | 5,254.325 ns | 64.9089 ns | 57.5401 ns | - | | Add_With_ArrayMath | 262144 | 29,758.965 ns | 558.1937 ns | 548.2209 ns | - | | Add_With_TensorPrimitives | 262144 | 48,203.521 ns | 359.9023 ns | 319.0440 ns | - | This shows that outside of some minor noise for very small sizes, `TensorPrimitives` is clearly faster up until the last entry. ### So why is it slower? The answer to this comes down to understanding how `TensorPrimitives` operates, how Benchmark.NET works, and finally what what the benchmark in question is testing. For the first, `TensorPrimitives` basically dispatches to the largest supported vector size out of V128, V256, and V512. It uses some clever logic to minimize overall branching, do the least number of operations, and generally try to stay fast. This includes attempting to align data to minimize cache line splits for stores and allows the use of `non-temporal stores` if the total data length is over a specific threshold. This threshold happens to be [256KB](https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,1ad6502a29024d1c), which just so happens to be the bucket the last benchmark falls into. Now, a `non-temporal store` is just like a regular store, only it is allowed to bypass the cache. This means at the surface level it is slower, but it actually ends up being significantly faster in real world code, especially when other threads are doing work. This is because anytime you do a load or store, data ends up polluting the cache and that necessitates eviction of older data in the cache. If you're operating on more than half the L3's capacity, you're going to be forcing all subsequent loads/stores to be significantly slower. By avoiding polluting the cache, you can avoid "cache thrashing" and help keep the overall workload faster. We then need to understand [how Benchmark.NET works](https://benchmarkdotnet.org/articles/guides/how-it-works.html) which can basically be summed up as each `Benchmark` is launched in its own process. However, iterations of a benchmark run within that process and thus if you have a `[GlobalSetup]` such fields are reused across each iteration of a benchmark (but not across 2 distinct benchmarks). Given that and since the benchmark in question is using a `[GlobalSetup]` to initialize 3 arrays to the appropriate length, it means that the first iteration of a benchmark is starting from a "clean slate" and that the second and subsequent iterations are going to be slightly skewed by reuse of the same arrays which are now in the cache. That, in turn means that the use of `non-temporal stores` and the reuse of `Image1` as both the destination and result will skew the results. This can somewhat be seen be changing the test to use a different buffer for the destination (that is, add a fourth array `Result`): | Method | Length | Mean | Error | StdDev | Allocated | |-------------------------- |------- |--------------:|------------:|------------:|----------:| | Add_With_ArrayMath | 1 | 4.234 ns | 0.0337 ns | 0.0281 ns | - | | Add_With_TensorPrimitives | 1 | 4.592 ns | 0.0253 ns | 0.0224 ns | - | | Add_With_ArrayMath | 4 | 6.177 ns | 0.0490 ns | 0.0458 ns | - | | Add_With_TensorPrimitives | 4 | 4.731 ns | 0.0464 ns | 0.0411 ns | - | | Add_With_ArrayMath | 16 | 5.863 ns | 0.0184 ns | 0.0153 ns | - | | Add_With_TensorPrimitives | 16 | 8.014 ns | 0.0898 ns | 0.0840 ns | - | | Add_With_ArrayMath | 64 | 10.070 ns | 0.0300 ns | 0.0280 ns | - | | Add_With_TensorPrimitives | 64 | 10.684 ns | 0.0577 ns | 0.0539 ns | - | | Add_With_ArrayMath | 256 | 28.272 ns | 0.5779 ns | 1.4283 ns | - | | Add_With_TensorPrimitives | 256 | 24.961 ns | 0.2472 ns | 0.2312 ns | - | | Add_With_ArrayMath | 1024 | 101.812 ns | 1.3147 ns | 1.2298 ns | - | | Add_With_TensorPrimitives | 1024 | 71.354 ns | 0.4869 ns | 0.4554 ns | - | | Add_With_ArrayMath | 4096 | 466.880 ns | 2.8492 ns | 2.6651 ns | - | | Add_With_TensorPrimitives | 4096 | 405.381 ns | 2.9016 ns | 2.7141 ns | - | | Add_With_ArrayMath | 16384 | 6,258.703 ns | 13.5436 ns | 10.5740 ns | - | | Add_With_TensorPrimitives | 16384 | 2,550.121 ns | 17.8553 ns | 16.7019 ns | - | | Add_With_ArrayMath | 65536 | 7,998.220 ns | 52.1685 ns | 46.2461 ns | - | | Add_With_TensorPrimitives | 65536 | 7,040.351 ns | 19.4208 ns | 15.1624 ns | - | | Add_With_ArrayMath | 262144 | 34,084.083 ns | 128.1989 ns | 107.0518 ns | - | | Add_With_TensorPrimitives | 262144 | 33,502.424 ns | 194.1741 ns | 172.1303 ns | - | You'll note that not only is the general run to run slower on average for both, but the gap between the runs shrinks as we reach the non-temporal threshold limit, however, `TensorPrimitives` remains faster for the largest length rather than being measurably slower and therefore helping show off some of the nuance that is impacted by non-temporal stores impacting the run to run relevance. ### Is there anything to be done? The answer to this next question is, "well maybe, but its impossible to say for sure". This is simply because the "best" implementation is dependent on a huge number of factors that will be scenario specific. What we could do to improve things is maybe two-fold: 1. We could consider handling non-temporal operations differently if we know that the destination buffer and a source buffer overlap. In this scenario, we know that we're polluting the cache anyways due to the loads and so we could potentially win some perf back by either also doing non-temporal loads or by skipping the non-temporal stores. 2. We could update the `NonTemporalByteThreshold` to be more exact for your CPU. Many modern CPUs have significantly larger L3 cache (some even going up to 64-256MB) and so querying that and using it as the actual threshold rather than an "average" threshold can help avoid pessimizing CPUs with larger caches available. But, as said above whether either of these will be that beneficial in practice is hard to determine. If you're only doing single-threaded work, then non-temporal accesses basically don't help much and can overall hurt things. They can also hurt some for overlapping buffers. On the other hand, as the length gets bigger and the risk for cache contention increases, it can allow for significantly faster overall application performance, even if the single thread is now slower. ### Bonus Perf Results Just to complete out the analysis, here's the results for running the tests taking into consideration a completely new array allocation for each iteration (which gives a little bit more baseline overhead). This helps remove some of the noise between the runs, but not necessarily all of it and so can give a slightly better view as to the impact when you're not factoring in as much cache pollution from a previous run. -- Also keeping in mind that results on your own machine may vary a bit and that `NonTemporalByteThreshold` isn't a perfect match for your exact hardware (so its doing it sooner than strictly appropriate on my AMD Ryzen 7950X, for example). #### Image1 and Result are the same buffer | Method | Length | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |-------------------------- |------- |--------------:|-------------:|--------------:|---------:|---------:|---------:|----------:| | Add_With_ArrayMath | 1 | 14.47 ns | 0.299 ns | 0.320 ns | 0.0057 | - | - | 96 B | | Add_With_TensorPrimitives | 1 | 14.89 ns | 0.238 ns | 0.198 ns | 0.0057 | - | - | 96 B | | Add_With_ArrayMath | 4 | 15.98 ns | 0.166 ns | 0.139 ns | 0.0072 | - | - | 120 B | | Add_With_TensorPrimitives | 4 | 15.72 ns | 0.332 ns | 0.369 ns | 0.0072 | - | - | 120 B | | Add_With_ArrayMath | 16 | 17.17 ns | 0.210 ns | 0.186 ns | 0.0158 | - | - | 264 B | | Add_With_TensorPrimitives | 16 | 18.86 ns | 0.325 ns | 0.304 ns | 0.0158 | - | - | 264 B | | Add_With_ArrayMath | 64 | 29.30 ns | 0.592 ns | 0.886 ns | 0.0502 | 0.0002 | - | 840 B | | Add_With_TensorPrimitives | 64 | 29.05 ns | 0.366 ns | 0.325 ns | 0.0502 | 0.0002 | - | 840 B | | Add_With_ArrayMath | 256 | 74.32 ns | 0.581 ns | 0.485 ns | 0.1879 | 0.0026 | - | 3144 B | | Add_With_TensorPrimitives | 256 | 70.54 ns | 1.422 ns | 1.521 ns | 0.1879 | 0.0026 | - | 3144 B | | Add_With_ArrayMath | 1024 | 293.33 ns | 5.205 ns | 4.614 ns | 0.7362 | 0.0405 | - | 12360 B | | Add_With_TensorPrimitives | 1024 | 242.21 ns | 4.848 ns | 4.762 ns | 0.7365 | 0.0408 | - | 12360 B | | Add_With_ArrayMath | 4096 | 804.96 ns | 11.538 ns | 10.792 ns | 2.9325 | 0.7324 | - | 49224 B | | Add_With_TensorPrimitives | 4096 | 636.99 ns | 11.172 ns | 10.450 ns | 2.9325 | 0.7324 | - | 49224 B | | Add_With_ArrayMath | 16384 | 2,577.16 ns | 24.193 ns | 22.630 ns | 11.7188 | 7.8125 | - | 196680 B | | Add_With_TensorPrimitives | 16384 | 2,037.59 ns | 15.738 ns | 13.952 ns | 11.7188 | 7.8125 | - | 196680 B | | Add_With_ArrayMath | 65536 | 37,638.10 ns | 329.949 ns | 308.635 ns | 249.9390 | 249.9390 | 249.9390 | 786588 B | | Add_With_TensorPrimitives | 65536 | 35,651.46 ns | 408.862 ns | 382.450 ns | 249.9390 | 249.9390 | 249.9390 | 786588 B | | Add_With_ArrayMath | 262144 | 201,937.37 ns | 3,968.504 ns | 8,195.649 ns | 495.3613 | 495.3613 | 495.3613 | 3145931 B | | Add_With_TensorPrimitives | 262144 | 244,793.83 ns | 5,416.888 ns | 15,801.317 ns | 499.0234 | 499.0234 | 499.0234 | 3145934 B | #### Image1 and Result are distinct buffers | Method | Length | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |-------------------------- |------- |--------------:|--------------:|--------------:|---------:|---------:|---------:|----------:| | Add_With_ArrayMath | 1 | 15.74 ns | 0.333 ns | 0.574 ns | 0.0076 | - | - | 128 B | | Add_With_TensorPrimitives | 1 | 17.17 ns | 0.247 ns | 0.206 ns | 0.0076 | - | - | 128 B | | Add_With_ArrayMath | 4 | 17.26 ns | 0.189 ns | 0.168 ns | 0.0095 | - | - | 160 B | | Add_With_TensorPrimitives | 4 | 17.09 ns | 0.257 ns | 0.241 ns | 0.0095 | - | - | 160 B | | Add_With_ArrayMath | 16 | 19.61 ns | 0.394 ns | 0.454 ns | 0.0210 | - | - | 352 B | | Add_With_TensorPrimitives | 16 | 20.59 ns | 0.222 ns | 0.197 ns | 0.0210 | - | - | 352 B | | Add_With_ArrayMath | 64 | 34.80 ns | 0.693 ns | 0.681 ns | 0.0669 | 0.0002 | - | 1120 B | | Add_With_TensorPrimitives | 64 | 34.18 ns | 0.701 ns | 0.656 ns | 0.0669 | 0.0002 | - | 1120 B | | Add_With_ArrayMath | 256 | 91.31 ns | 1.400 ns | 1.241 ns | 0.2506 | 0.0039 | - | 4192 B | | Add_With_TensorPrimitives | 256 | 87.29 ns | 1.396 ns | 1.306 ns | 0.2506 | 0.0039 | - | 4192 B | | Add_With_ArrayMath | 1024 | 327.98 ns | 6.580 ns | 12.988 ns | 0.9818 | 0.0610 | - | 16480 B | | Add_With_TensorPrimitives | 1024 | 306.69 ns | 5.853 ns | 13.331 ns | 0.9818 | 0.0610 | - | 16480 B | | Add_With_ArrayMath | 4096 | 909.82 ns | 18.064 ns | 34.804 ns | 3.9101 | 0.9766 | - | 65632 B | | Add_With_TensorPrimitives | 4096 | 813.29 ns | 16.052 ns | 23.528 ns | 3.9101 | 0.9766 | - | 65632 B | | Add_With_ArrayMath | 16384 | 3,074.07 ns | 54.632 ns | 53.656 ns | 15.6250 | 15.6212 | - | 262240 B | | Add_With_TensorPrimitives | 16384 | 2,728.83 ns | 53.497 ns | 50.041 ns | 15.6250 | 15.6212 | - | 262240 B | | Add_With_ArrayMath | 65536 | 49,784.38 ns | 933.758 ns | 1,368.691 ns | 210.5713 | 210.5103 | 210.5103 | 1048709 B | | Add_With_TensorPrimitives | 65536 | 51,611.09 ns | 900.472 ns | 842.302 ns | 208.2520 | 208.1909 | 208.1909 | 1048708 B | | Add_With_ArrayMath | 262144 | 540,826.40 ns | 21,972.949 ns | 64,787.736 ns | 957.0313 | 957.0313 | 957.0313 | 4194698 B | | Add_With_TensorPrimitives | 262144 | 612,220.95 ns | 21,652.022 ns | 63,841.476 ns | 971.6797 | 971.6797 | 971.6797 | 4194703 B |