Open SteveSandersonMS opened 2 years ago
Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.
Count:
@message
@code { int count = 5_000_000; float[]? values; string? message; void CreateNumbers() { if (values?.Length == count) return; values = new float[count]; var rng = new Random(0); for (var i = 0; i < values.Length; i++) values[i] = (float)(1000000 * rng.NextDouble()); } void FindMax_NonVectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var max = float.MinValue; for (var i = 0; i < values!.Length; i++) if (values[i] > max) max = values[i]; sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } void FindMax_Vectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var values_vec = MemoryMarshal.CastAuthor: | SteveSandersonMS |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `tenet-performance` |
Milestone: | - |
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.
Count:
@message
@code { int count = 5_000_000; float[]? values; string? message; void CreateNumbers() { if (values?.Length == count) return; values = new float[count]; var rng = new Random(0); for (var i = 0; i < values.Length; i++) values[i] = (float)(1000000 * rng.NextDouble()); } void FindMax_NonVectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var max = float.MinValue; for (var i = 0; i < values!.Length; i++) if (values[i] > max) max = values[i]; sw.Stop(); message = $"Max value is {max} in {sw.ElapsedMilliseconds:F0}ms"; } void FindMax_Vectorized() { CreateNumbers(); var sw = new Stopwatch(); sw.Start(); var values_vec = MemoryMarshal.CastAuthor: | SteveSandersonMS |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `tenet-performance`, `untriaged`, `area-Codegen-Interpreter-mono` |
Milestone: | - |
Thanks for the testcase. This area was never really addressed in the interpreter. I plan to work on improving HW intrinsics / vector operations for .net 8.
This probably happens because the fallback code for the vector operations is written assuming an optimizing compiler, like for Max():
for (int index = 0; index < Vector128<T>.Count; index++)
{
T value = Scalar<T>.GreaterThan(left.GetElementUnsafe(index), right.GetElementUnsafe(index)) ? left.GetElementUnsafe(index) : right.GetElementUnsafe(index);
result.SetElementUnsafe(index, value);
}
This probably happens because the fallback code for the vector operations is written assuming an optimizing compiler, like for Max():
It's not written assuming an optimizing compiler. It's written to ensure the code can be easily maintained.
We ultimately have 12 types we have to support for each operation and are working in a generic context. We can either have simple implementations like this, or methods which are hundreds of lines long and manually unrolling everything.
Regardless of optimization support or not, that much code is not maintainable and adds additional risk and bug opportunities and is not something we are going to check in. An optimizing compiler "can" do the right thing here (such as unrolling or simple inlining), but a non-optimizing compiler will have just as many (if not more) problems with the alternative larger implementation.
Ultimately, an optimizing compiler should prefer intrinsic recognition and transformation of these exact methods directly to the corresponding SIMD operation. Something like an AOT compiler should recognize the existing and fairly standardized coding patterns around generics (e.g. the typeof(T) == typeof(...)
checks) and do the basic dead code optimization.
I took a quick look at this testcase. I find it hard to believe that these numbers were actually real for interpreter. Currently on main, for this testcase, on desktop, vectorized path is about 3-4 times slower, on wasm it is normally about 4-5 slower. However jiterp makes the vectorized path actually 3 times faster than non-vectorized path (I assume v8 has something to do with optimizing our bad jitted code).
We added actual vector 128 support but some of the use cases have no support planned yet, like for floats, since they are not really used internally in the BCL.
On interpreter, the vectorized path would get a significant boost once we implement loop unrolling, maybe for .NET 9. Keeping this open to check again afterwards.
Description
Vector operations like
Vector128.Max
are ~15x slower than manually doing the equivalent non-vector operations on each component in series. I know the interpreter does not yet support vectorization (and that it's an AOT-only feature); this is just a question about whether the interpreter should be expected to slow down so much compared with non-vectorized code on the interpreter.Repro
Index.razor
to the code below<WasmEnableSIMD>true</WasmEnableSIMD>
and<RunAOTCompilation>true</RunAOTCompilation>
to the csprojdotnet run
) and AOT builds (viadotnet publish -c Release
)Results:
It's worth pointing out that the performance with AOT is really superb and the runtime team should be very proud of this! It's within 20% of the numbers I get when running the same code in release-mode on desktop CoreCLR with JIT, which is phenomenal.
The only question is: should we really see the 6200ms number in the bottom left, or could it be closer to the 420ms in the top left.
Code:
Analysis
I guess this could be understood as "wow look the interpreter has great optimizations for pure numeric operations", and that the additional cost of calling
Vector.Max
etc (versus a max over each component in series) is simply the cost of doing a .NET method call.If that's the case maybe there isn't much to be done about it. Theoretically I guess it would be possible for the interpreter to recognize
Vector.*
calls and translate them into an equivalent series of inline numerical operations. But if we're going to such lengths, maybe the interpreter should translate them into vectorized operations and that's the longer term plan anyway.