dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.12k stars 4.7k forks source link

Roadmap information on System.Numerics AVX-512 vectorization needed #92189

Closed BeechHedge closed 1 year ago

BeechHedge commented 1 year ago

This is a suggestion. The info in the communication on .NET8 suggested that the runtime would become AVX-512 enabled. So I expected that System.Numerics would have been AVX-512 accelerated. However, some experiments with .NET8 rc1 show that this is not the case (yet?). See: https://stackoverflow.com/questions/77118399/will-c-sharp-system-numerics-namespace-run-on-avx-512-in-net8-or-in-the-near-fu Enabling of 8-fold vectorization would entail maintenance in our current code base (as it tacitly assumes that vectorization is always 2-fold or 4-fold). Therefore, it would be appreciated very much if the .NET team could publish relevant roadmap information on this topic. So that we can schedule the necessary maintenance.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

Issue Details
This is a suggestion. The info in the communication on .NET8 suggested that the runtime would become AVX-512 enabled. So I expected that System.Numerics would have been AVX-512 accelerated. However, some experiments with .NET8 rc1 show that this is not the case (yet?). See: [https://stackoverflow.com/questions/77118399/will-c-sharp-system-numerics-namespace-run-on-avx-512-in-net8-or-in-the-near-fu](url) Enabling of 8-fold vectorization would entail maintenance in our current code base (as it tacitly assumes that vectorization is always 2-fold or 4-fold). Therefore, it would be appreciated very much if the .NET team could publish relevant roadmap information on this topic. So that we can schedule the necessary maintenance.
Author: BeechHedge
Assignees: -
Labels: `area-System.Numerics`, `untriaged`
Milestone: -
stephentoub commented 1 year ago

You can use the fixed-width Vector512<T> in .NET 8 (and Vector128<T> and Vector256<T> that were previously introduced). In .NET 8, the variable-width Vector<T> will not automatically support widths greater than 256 bits. It's likely in .NET 9 you'll be able to opt-in to that, but at present it's not clear whether it'll be enabled by default, in part because of breaking change concerns (code that took an implicit dependency on the max width) and in part based on it being a deoptimization for code with lengths that would have been accelerated on 256 but would be too short to benefit from 512.

BeechHedge commented 1 year ago

@stephentoub, Clear, thank you very much for the information!

pcordes commented 1 year ago

Related: C compilers like GCC and Clang default to -mprefer-vector-width=256 even on recent Intel CPUs like -mtune=icelake-server. But that's for auto-vectorization, which can include cold loops, unlike this case where only manually-vectorized code uses this.

But part of the reason applies even for some relatively "hot" code: SIMD instructions lowering CPU frequency on Stack Overflow has some of the details about how max turbo frequency can be limited. (Or CPUs might need to raise the voltage at the same frequency, after the usual period of limited throughput, even if like Ice Lake / Rocket Lake client CPUs the max-turbo difference is often much smaller and sometimes nonexistent.) This was discussed on https://reviews.llvm.org/D111029, including some Intel testing results that found clang auto-vectorization of SPEC2017 actually got a 1% slowdown with -mprefer-vector-width=512 vs. 256. But again, that's auto-vectorization of scalar code, not like C# where this would only affect manually-vectorized loops.

In a program that frequently wakes up for short bursts of computation, its AVX-512 usage will still lower turbo frequency for the core, affecting other programs.

Plus, for some code the gains aren't 2x since having any 512-bit uops in flight means the vector ALUs on port 1 are shut down (or are used as part of the 512-bit ALUs on port 0). Also, penalties for misaligned data are worse with 64-byte loads/stores on Intel. Even when the in/out streams miss in cache all the way to DRAM, misaligned is slower than aligned by maybe 15% with 512-bit vectors. (And I suspect that means 512-bit vectors are slower than 256, since 256-bit vectors should be pretty much maxing out single-core memory bandwidth.)

It certainly can be profitable to vectorize with 512-bit vectors in programs that spend a lot of their time running SIMD code, especially on CPUs with a second 512-bit FMA unit. (Like Xeon Gold / Platinum, and some lower-end CPUs, but not Ice Lake "client" laptop CPUs). Or for non-FP workloads where there's a big speedup. Of course, 256-bit versions of AVX-512 new instructions (like vpternlogd for bitwise booleans, or vpopcntq) can be very very good for some problems, giving great speedups with no need for 512-bit vector width.


Things are different on Zen 4; they handle 512-bit vectors by taking extra cycles in the execution units, so as long as 512-bit vectors don't require more shuffling work or some other effect that would add overhead, 512-bit vectors are a good win for front-end throughput and how far ahead out-of-order exec can see in terms of elements or scalar iterations. (Since a 512-bit uop is still only 1 uop for the front-end.) GCC and Clang default to -mprefer-vector-width=512 for -march=znver4.

There's no turbo penalty or other inherent downsides to 512-bit vectors on Zen 4 (AFAIK; I don't know how misaligned loads perform). It's just a matter of whether software can use them efficiently (without needing more bloated code for loop prologues / epilogues, e.g. scalar cleanup if a masked final iteration doesn't Just Work.) AVX-512 masked stores are efficient on Zen 4, despite the fact that AVX1/2 vmaskmovps / vpmaskmovd aren't. (https://uops.info/)

For code where you have exactly 32 bytes of something, if the 32-byte vectors are no longer an option then that's a loss. C#'s scalable vector-length model isn't ideal for those cases. ARM SVE or RISC-V Vector extensions where the hardware ISA are designed around a variable vector-length with masking to handle vectors shorter than the HW's native length, but doing the same thing for C# Vector<> probably wouldn't work well because lots of hardware (x86 with AVX2, or AArch64 without SVE) can't efficiently support masking for arbitrary-length stuff.

BeechHedge commented 1 year ago

@pcordes, thank you for your detailed information on SIMD performance. From the answer of stephentoub, it is clear that we (likely) will have several options going forward after .NET9 is ready. From your comment it is clear that we should carefully carry out performance experiments before choosing which way to go.

gfoidl commented 1 year ago

https://github.com/dotnet/runtime/pull/85551 gives a (undocumented) config switch that allows to set the bit-width of Vector<T>.

stephentoub commented 1 year ago

config switch that allows to set the bit-width

But currently not beyond 256.