dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.44k stars 4.76k forks source link

WebAssembly vectorization perf on interpreter: observations/questions #76738

Open SteveSandersonMS opened 2 years ago

SteveSandersonMS commented 2 years ago

Description

Vector operations like Vector128.Max are ~15x slower than manually doing the equivalent non-vector operations on each component in series. I know the interpreter does not yet support vectorization (and that it's an AOT-only feature); this is just a question about whether the interpreter should be expected to slow down so much compared with non-vectorized code on the interpreter.

Repro

Results:

Interpreter AOT
Non-vectorized 420ms 8ms
Vectorized 6200ms 3ms

It's worth pointing out that the performance with AOT is really superb and the runtime team should be very proud of this! It's within 20% of the numbers I get when running the same code in release-mode on desktop CoreCLR with JIT, which is phenomenal.

The only question is: should we really see the 6200ms number in the bottom left, or could it be closer to the 420ms in the top left.

Code:

@page "/"
@using System.Diagnostics
@using System.Runtime.Intrinsics
@using System.Runtime.InteropServices

<h1>Find the biggest number</h1>
<p>Count: <input type="number" @bind="count" /></p>
<button @onclick="FindMax_NonVectorized">Non-vectorized</button>
<button @onclick="FindMax_Vectorized">Vectorized</button>

<p>@message</p>

@code {
    int count = 5_000_000;
    float[]? values;
    string? message;

    void CreateNumbers()
    {
        if (values?.Length == count)
            return;
        values = new float[count];
        var rng = new Random(0);
        for (var i = 0; i < values.Length; i++)
            values[i] = (float)(1000000 * rng.NextDouble());
    }

    void FindMax_NonVectorized()
    {
        CreateNumbers();

        var sw = new Stopwatch();
        sw.Start();

        var max = float.MinValue;
        for (var i = 0; i < values!.Length; i++)
            if (values[i] > max)
                max = values[i];

        sw.Stop();
        message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms";
    }

    void FindMax_Vectorized()
    {
        CreateNumbers();

        var sw = new Stopwatch();
        sw.Start();

        var values_vec = MemoryMarshal.Cast<float, Vector128<float>>(values);

        var max_vec = values_vec[0];
        for (var i = 1; i < values_vec.Length; i++)
            max_vec = Vector128.Max(max_vec, values_vec[i]);

        var max = float.MinValue;
        for (var i = 0; i < 4; i++)
            max = Math.Max(max, max_vec[i]);

        sw.Stop();
        message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms";
    }
}

Analysis

I guess this could be understood as "wow look the interpreter has great optimizations for pure numeric operations", and that the additional cost of calling Vector.Max etc (versus a max over each component in series) is simply the cost of doing a .NET method call.

If that's the case maybe there isn't much to be done about it. Theoretically I guess it would be possible for the interpreter to recognize Vector.* calls and translate them into an equivalent series of inline numerical operations. But if we're going to such lengths, maybe the interpreter should translate them into vectorized operations and that's the longer term plan anyway.

ghost commented 2 years ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Vector operations like `Vector128.Max` are ~15x slower than manually doing the equivalent non-vector operations on each component in series. I know the interpreter does not yet support vectorization (and that it's an AOT-only feature); this is just a question about whether the interpreter should be expected to slow down so much compared with non-vectorized code on the interpreter. ### Repro * Create an Empty Blazor WebAssembly app using .NET 8.0 nightly SDK (I tried 8.0.100-alpha.1.22478.26) * Change `Index.razor` to the code below * Add `true` and `true` to the csproj * Compare performance in both interpreter builds (`dotnet run`) and AOT builds (via `dotnet publish -c Release`) Results: | | Interpreter | AOT | | ------------- | ------------- |------------- | | **Non-vectorized** | 420ms | 8ms | | **Vectorized** | 6200ms | 3ms | It's worth pointing out that **the performance with AOT is really superb and the runtime team should be very proud of this**! It's within 20% of the numbers I get when running the same code in release-mode on desktop CoreCLR with JIT, which is phenomenal. The only question is: should we really see the 6200ms number in the bottom left, or could it be closer to the 420ms in the top left. Code: ```razor @page "/" @using System.Diagnostics @using System.Runtime.Intrinsics @using System.Runtime.InteropServices

Find the biggest number

Count:

@message

@code { int count = 5_000_000; float[]? values; string? message; void CreateNumbers() { if (values?.Length == count) return; values = new float[count]; var rng = new Random(0); for (var i = 0; i < values.Length; i++) values[i] = (float)(1000000 * rng.NextDouble()); } void FindMax_NonVectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var max = float.MinValue; for (var i = 0; i < values!.Length; i++) if (values[i] > max) max = values[i]; sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } void FindMax_Vectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var values_vec = MemoryMarshal.Cast>(values); var max_vec = values_vec[0]; for (var i = 1; i < values_vec.Length; i++) max_vec = Vector128.Max(max_vec, values_vec[i]); var max = float.MinValue; for (var i = 0; i < 4; i++) max = Math.Max(max, max_vec[i]); sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } } ``` ### Analysis I guess this could be understood as "wow look the interpreter has great optimizations for pure numeric operations", and that the additional cost of calling `Vector.Max` etc (versus a max over each component in series) is simply the cost of doing a .NET method call. If that's the case maybe there isn't much to be done about it. Theoretically I guess it would be possible for the interpreter to recognize `Vector.*` calls and translate them into an equivalent series of inline numerical operations. But if we're going to such lengths, maybe the interpreter should translate them into vectorized operations and that's the longer term plan anyway.
Author: SteveSandersonMS
Assignees: -
Labels: `arch-wasm`, `tenet-performance`
Milestone: -
dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Vector operations like `Vector128.Max` are ~15x slower than manually doing the equivalent non-vector operations on each component in series. I know the interpreter does not yet support vectorization (and that it's an AOT-only feature); this is just a question about whether the interpreter should be expected to slow down so much compared with non-vectorized code on the interpreter. ### Repro * Create an Empty Blazor WebAssembly app using .NET 8.0 nightly SDK (I tried 8.0.100-alpha.1.22478.26) * Change `Index.razor` to the code below * Add `true` and `true` to the csproj * Compare performance in both interpreter builds (`dotnet run`) and AOT builds (via `dotnet publish -c Release`) Results: | | Interpreter | AOT | | ------------- | ------------- |------------- | | **Non-vectorized** | 420ms | 8ms | | **Vectorized** | 6200ms | 3ms | It's worth pointing out that **the performance with AOT is really superb and the runtime team should be very proud of this**! It's within 20% of the numbers I get when running the same code in release-mode on desktop CoreCLR with JIT, which is phenomenal. The only question is: should we really see the 6200ms number in the bottom left, or could it be closer to the 420ms in the top left. Code: ```razor @page "/" @using System.Diagnostics @using System.Runtime.Intrinsics @using System.Runtime.InteropServices

Find the biggest number

Count:

@message

@code { int count = 5_000_000; float[]? values; string? message; void CreateNumbers() { if (values?.Length == count) return; values = new float[count]; var rng = new Random(0); for (var i = 0; i < values.Length; i++) values[i] = (float)(1000000 * rng.NextDouble()); } void FindMax_NonVectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var max = float.MinValue; for (var i = 0; i < values!.Length; i++) if (values[i] > max) max = values[i]; sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } void FindMax_Vectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var values_vec = MemoryMarshal.Cast>(values); var max_vec = values_vec[0]; for (var i = 1; i < values_vec.Length; i++) max_vec = Vector128.Max(max_vec, values_vec[i]); var max = float.MinValue; for (var i = 0; i < 4; i++) max = Math.Max(max, max_vec[i]); sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } } ``` ### Analysis I guess this could be understood as "wow look the interpreter has great optimizations for pure numeric operations", and that the additional cost of calling `Vector.Max` etc (versus a max over each component in series) is simply the cost of doing a .NET method call. If that's the case maybe there isn't much to be done about it. Theoretically I guess it would be possible for the interpreter to recognize `Vector.*` calls and translate them into an equivalent series of inline numerical operations. But if we're going to such lengths, maybe the interpreter should translate them into vectorized operations and that's the longer term plan anyway.
Author: SteveSandersonMS
Assignees: -
Labels: `arch-wasm`, `tenet-performance`, `untriaged`, `area-Codegen-Interpreter-mono`
Milestone: -
BrzVlad commented 2 years ago

Thanks for the testcase. This area was never really addressed in the interpreter. I plan to work on improving HW intrinsics / vector operations for .net 8.

vargaz commented 2 years ago

This probably happens because the fallback code for the vector operations is written assuming an optimizing compiler, like for Max():

            for (int index = 0; index < Vector128<T>.Count; index++)
            {
                T value = Scalar<T>.GreaterThan(left.GetElementUnsafe(index), right.GetElementUnsafe(index)) ? left.GetElementUnsafe(index) : right.GetElementUnsafe(index);
                result.SetElementUnsafe(index, value);
            }
tannergooding commented 2 years ago

This probably happens because the fallback code for the vector operations is written assuming an optimizing compiler, like for Max():

It's not written assuming an optimizing compiler. It's written to ensure the code can be easily maintained.

We ultimately have 12 types we have to support for each operation and are working in a generic context. We can either have simple implementations like this, or methods which are hundreds of lines long and manually unrolling everything.

Regardless of optimization support or not, that much code is not maintainable and adds additional risk and bug opportunities and is not something we are going to check in. An optimizing compiler "can" do the right thing here (such as unrolling or simple inlining), but a non-optimizing compiler will have just as many (if not more) problems with the alternative larger implementation.

Ultimately, an optimizing compiler should prefer intrinsic recognition and transformation of these exact methods directly to the corresponding SIMD operation. Something like an AOT compiler should recognize the existing and fairly standardized coding patterns around generics (e.g. the typeof(T) == typeof(...) checks) and do the basic dead code optimization.

BrzVlad commented 1 year ago

I took a quick look at this testcase. I find it hard to believe that these numbers were actually real for interpreter. Currently on main, for this testcase, on desktop, vectorized path is about 3-4 times slower, on wasm it is normally about 4-5 slower. However jiterp makes the vectorized path actually 3 times faster than non-vectorized path (I assume v8 has something to do with optimizing our bad jitted code).

We added actual vector 128 support but some of the use cases have no support planned yet, like for floats, since they are not really used internally in the BCL.

On interpreter, the vectorized path would get a significant boost once we implement loop unrolling, maybe for .NET 9. Keeping this open to check again afterwards.