dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15k stars 4.67k forks source link

[wasm] broad regressions in vector performance under interpreter #99554

Open kg opened 6 months ago

kg commented 6 months ago

I've been unable to identify the cause so far, but https://github.com/dotnet/perf-autofiling-issues/issues/29881 shows regressions in the range of 1.1x to 1.3x across various vector microbenchmarks, along with a few algorithms that are likely being hit due to vectorization. For two examples, see https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu%2022.04_CompilationMode=wasm_RunKind=micro/System.Runtime.Intrinsics.Tests.Perf_Vector128Of(UInt16).LessThanOrEqualAnyBenchmark.html (definitely this) and https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu%2022.04_CompilationMode=wasm_RunKind=micro/System.Buffers.Tests.SearchValuesCharTests.IndexOfAnyExcept(Values%3a%20%22abcdefABCDEF0123456789%22).html (possibly this).

I looked into the LessThanOrAny scenario to examine the IR and generated code. Jiterp seems to be doing a good job and most of the interp optimizations are working, but there's a lot of room for improvement to the code, see this gist: https://gist.github.com/kg/0514083d03ad8dce4bfdd93ecacd63a1 It contains sample code to repro it + annotated interp IR + pseudocode for the C# that is running post-optimization.

Will try to diff the interp IR against the old pre-SSA version to see if I can spot anything, and update this issue if I do.

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @brzvlad, @kotlarmilos See info in area-owners.md if you want to be subscribed.

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

kg commented 6 months ago

For LessThanOrAny, I compared with the generated interp IR for the start of the diff range. Most of the differences seem unimportant, and in general the new IR is higher quality/more efficient.

One difference I noticed that might matter is that the backbranches for the loop condition used to not contain safepoints, and now they do. Safepoints in jiterp are somewhat expensive, so that could be responsible for the regression. i.e.:

ldc.i4.4 56
...
blt.i4.s 40 56

->

blt.i4.imm.sp 22
lewing commented 1 month ago

@kg is there work left to do here?

kg commented 1 month ago

I'm not sure whether we've made up for all the regressions in these areas.

lewing commented 1 month ago

looking at the linked issue most of the Vector regressions appear to be in good shape now. The regex stuff is a different story

tannergooding commented 1 month ago

It would be nice if we could get the vector support for WASM interpreter inline with the LLVM/JIT support in .NET 10.

I'm going to continue iterating on some of it in my freetime, but if we could schedule some top down work here that'd be greatly beneficial -- CC. @jeffhandley

kg commented 1 month ago

It would be nice if we could get the vector support for WASM interpreter inline with the LLVM/JIT support in .NET 10.

I'm going to continue iterating on some of it in my freetime, but if we could schedule some top down work here that'd be greatly beneficial -- CC. @jeffhandley

For WASM that will be tricky due to the very constrained instruction set, but I'm happy to help brainstorm on what we can do there. Right now the jiterpreter and interp should both support all the operations exposed by WASM's baseline SIMD, but only via the platform-specific wasm static methods. Expanding the support to cover the whole Vector128/Vector4/etc namespace could be profitable but I'm not sure how we could do that efficiently.

tannergooding commented 1 month ago

Expanding the support to cover the whole Vector128/Vector4/etc namespace could be profitable but I'm not sure how we could do that efficiently.

The xplat APIs are the more important ones to support, as they're the primary ones used by the BCL. We even recommend using things like x + y over PackedSimd.Add(x, y), when they are 1-to-1 equivalent in functionality.

I have some ideas here based on how we do it for RyuJIT. I think we can set up something that works nicely and maximizes code sharing and I think some of that can be shared between the interpreter and Mono JIT (a lot of the code is very similar already).

We should definitely schedule a meeting to brainstorm a little after RC1 snaps and we all have a little bit of time to do so