dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.74k forks source link

SSE/AVX byte multiplication could be improved #109775

Open IJzerbaard opened 2 days ago

IJzerbaard commented 2 days ago

SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:

   vpmovzxbw ymm0, qword ptr [rdx]
   vpmovzxbw ymm1, qword ptr [r8]
   vpmullw  ymm0, ymm0, ymm1
   vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
   vpackuswb ymm0, ymm0, ymm0
   vpermq   ymm0, ymm0, -40

Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.

I suggest two alternative strategies.

Strategy 1

First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        // unpack and multiply
        __m128i dst_even = _mm_mullo_epi16(a, b);
        __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
        // repack
        return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
    }

(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for VPBROADCASTW, but it would turn into a vector constant anyway so that is not important)

By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.

In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vpmullw xmm2, xmm1, xmm0
    vpsrlw  xmm0, xmm0, 8
    vpsrlw  xmm1, xmm1, 8
    vpmullw xmm0, xmm1, xmm0
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm2, xmmword ptr [rip + .LCPI1_0]
    vpor    xmm0, xmm1, xmm0

This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).

Strategy 2

Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255-128 nor 255127 would saturate by themselves and the other element in each pair will be zero.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        __m128i dst_even = _mm_maddubs_epi16(a, _mm_and_si128(b, _mm_set1_epi16(0xFF)));
        __m128i dst_odd = _mm_slli_epi16(_mm_maddubs_epi16(a, _mm_andnot_si128(_mm_set1_epi16(0xFF), b)), 8);
        return _mm_or_si128(_mm_and_si128(dst_even, _mm_set1_epi16(0xFF)), dst_odd);
    }

Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vmovdqa xmm2, xmmword ptr [rip + .LCPI2_0]
    vpand   xmm3, xmm1, xmm2
    vpmaddubsw      xmm3, xmm0, xmm3
    vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI2_1]
    vpmaddubsw      xmm0, xmm0, xmm1
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm3, xmm2
    vpor    xmm0, xmm1, xmm0

For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.

This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).

I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.

dotnet-policy-service[bot] commented 2 days ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

huoyaoyuan commented 2 days ago

Simple unrolled benchmark on Rapter Lake:

Method Mean Error StdDev Code Size
SimpleMultiply 739.7 ns 8.74 ns 8.18 ns 340 B
Stratergy1 577.5 ns 2.70 ns 2.40 ns 393 B
Stratergy2 584.8 ns 3.84 ns 3.59 ns 413 B

Code:

```csharp private Vector128[] a = new Vector128[1000], b = new Vector128[1000]; [Benchmark] public Vector128 SimpleMultiply() { Vector128 result = default; for (int i = 0; i < a.Length; i += 4) { result ^= SimpleMultiply(a[i], b[i]); result ^= SimpleMultiply(a[i + 1], b[i + 1]); result ^= SimpleMultiply(a[i + 2], b[i + 2]); result ^= SimpleMultiply(a[i + 3], b[i + 3]); } return result; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 SimpleMultiply(Vector128 left, Vector128 right) => left * right; [Benchmark] public Vector128 Stratergy1() { Vector128 result = default; for (int i = 0; i < a.Length; i += 4) { result ^= Stratergy1(a[i], b[i]); result ^= Stratergy1(a[i + 1], b[i + 1]); result ^= Stratergy1(a[i + 2], b[i + 2]); result ^= Stratergy1(a[i + 3], b[i + 3]); } return result; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 Stratergy1(Vector128 left, Vector128 right) { Vector128 dst_even = left.AsUInt16() * right.AsUInt16(); Vector128 dst_odd = (left.AsUInt16() >> 8) * (right.AsUInt16() >> 8); return ((dst_odd << 8) | (dst_even & Vector128.Create(0xFF))).AsByte(); } [Benchmark] public Vector128 Stratergy2() { Vector128 result = default; for (int i = 0; i < a.Length; i += 4) { result ^= Stratergy2(a[i], b[i]); result ^= Stratergy2(a[i + 1], b[i + 1]); result ^= Stratergy2(a[i + 2], b[i + 2]); result ^= Stratergy2(a[i + 3], b[i + 3]); } return result; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 Stratergy2(Vector128 left, Vector128 right) { Vector128 dst_even = Ssse3.MultiplyAddAdjacent(left, (right.AsUInt16() & Vector128.Create(0xFF)).AsSByte()); Vector128 dst_odd = Ssse3.MultiplyAddAdjacent(left, Vector128.AndNot(Vector128.Create(0xFF), right.AsUInt16()).AsSByte()) << 8; return ((dst_odd << 8) | (dst_even & Vector128.Create(0xFF))).AsByte(); } ```
tannergooding commented 2 days ago

There are a lot of broader considerations that exist than simply whatever is fastest in an isolated benchmark.

Other considerations include factoring in a broad range of hardware, how the code impacts the loops (alignment, micro-op cache size, etc), simplicity, what other code exists around the code in question, etc.

There's also notably some notable differences here based on whether its XMM or YMM based.


For example, on Intel vpmovzxbw takes 1 cycle for XMM but 3 for YMM and while there is only port 5 available pre-IceLake, on IceLake or late XMM has port 1 or 5 available. Alder Lake-E similarly reduced the cost of YMM to 1 cycle as well. -- AMD Zen2 and later similarly has 2 ports available (1 or 2) and takes 1 cycle for both XMM and YMM (it is similar to IceLake for Zen1).

vpmullw takes 5 cycles and has 2 ports available (0 or 1). It improved to 3-4 cycles on Alder Lake-E. On AMD Zen, it is 3 cycles and has a single port available (0), but this increased to 2 ports on Zen3

vpand takes 1 cycle and has 3 ports (0, 1, or 5) on Intel. On Zen it similarly takes 1 cycle and has 4 ports available (0, 1, 2, 3).

vpackuswb is 1 cycle on Skylake but slowed to 3 cycles on IceLake+, only ever having port 5 available. On Zen it takes 1 cycle and has 2 ports available (1 or 2).

vpermq takes 3 cycles and is only available on port 5. On Zen it is 4 cycles and has 2 ports (1 or 2)

From the other strategies, vpor has the same throughput as vpand. While vpsllw and vpsrlw take 1 cycle and have 2 ports available (0 or 1 on Intel and 1 or 2 on AMD). Finally vpmaddubsw matches the cost of vpmullw.


Given the above, the raw throughput potential (not factoring in memory accesses) is roughly:

Strategy AMD Intel
Current 11 11-12 (XMM), 14-18 (YMM)
1 15 15
2 15 15

The cost of a load is then approximately 3-7 cycles in typical cases and can impact normal pipelining and other considerations, so strategy 2 gains 12-28 cycles; while the current strategy and strategy 1 gain 9-21 cycles.

And despite pipelining being possible in some cases, the practical is that there are several longer dependency chains that will prevent code from executing in parallel in practice. So while theoretically strategy 1 can get the two vpmullw dispatched with a 1 cycle delay from eachother (causing them to take 6, not 10), this isn't a guarantee and may be infeasible in many scenarios.

So what we're left with is that strategy 1 is theoretically the same speed as the current strategy for XMM, but potentially slower while taking up more uop cache, causing less dense code, etc. While for YMM its likely faster on older hardware and hits the same considerations as XMM on newer hardware.

And while strategy 2 is theoretically the same speed as strategy 1, the practical is it will always end up worse due to necessary memory accesses and more competing ports.

In the end, I don't think its worth making the current pattern which is fairly idiomatic more complex for a theoretical 20% perf increase that is unlikely to pan out in many real world scenarios, especially on newer harwdware.

tannergooding commented 2 days ago

As an example, here is the same benchmark @huoyaoyuan gave above running on some more hardware


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100 [Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method Mean Error StdDev
SimpleMultiply 1.006 us 0.0067 us 0.0052 us
Stratergy1 1.142 us 0.0159 us 0.0149 us
Stratergy2 1.215 us 0.0074 us 0.0062 us

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3) 11th Gen Intel Core i9-11900H 2.50GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100 [Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method Mean Error StdDev
SimpleMultiply 792.1 ns 15.77 ns 19.94 ns
Stratergy1 859.9 ns 17.22 ns 20.50 ns
Stratergy2 922.2 ns 17.64 ns 18.11 ns

-- Will run on my AMD boxes when I get home tonight, and some various other older hardware I have as well.

IJzerbaard commented 2 days ago

OK those benchmark results are damning, but while I agree that this is context and hardware dependent, I cannot agree with the details of that analysis. "Throughput potential" is not a familiar measure to me, it's not the reciprocal throughput because that would have been lower, it's not latency because that also would have been lower (and probably shouldn't be the main target). I would suggest considering the execution port pressures on some µarchs.

If we're considering a variety of hypothetical contexts, strategies 1 and 2 having µops that spread better over the execution ports (as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.

Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.

tannergooding commented 2 days ago

it's not latency because that also would have been lower

The numbers given there are raw latency, not factoring in pipelining, decoding, or other considerations. For example, the current strategy (not factoring in memory loads) is 1+1+5+1+1+3 on Skylake (12 cycles) for XMM and 3+3+5+1+1+3 (15) for YMM. On IceLake YMM is 3+3+5+1+3+3 (18).

For strategy 1, you similarly have 5+1+1+5+1+1+1 (15). With pipelining, you can theoretically dispatch vpmullw on port 0 and vpsrlw on port 1 at the same time, then wait a cycle and reuse port 1 for vpsrlw again, then wait a cycle and dispatch vpmullw. You then need to wait 5 cycles before the third vpsllw can execute (by which point the first vpmullw is done). This gives you potentially 1+1+5+1+1+1, getting you to 10 cycles in the best case.

There is, however, the cost of the loads as well and those are also impactful. In strategy 1, the first vpmullw is dependent on 2 loads being completed before any work can start. These two loads are likely pipelined but still have a 3-7 cycle latency, so we have something like 5+1+1+5+1+5+1 (19) practical cycles.

While for the current strategy we have 5+5+5+1+1+3 (20) practical cycles, with newer hardware able to pipeline the two vpmovzxbw w/ embedded load giving us instead 15 practical cycles.

This then repros in real world benchmarks between the two, when you factor in a broad range of hardware. On pre IceLake hardware you're going to see Strategy 1 being 5-20% faster, depending on several factors. While on IceLake and later hardware you're going to find almost the inverse, with the current strategy being 5-20% faster.

Beyond the raw numbers, there is also the real impact to the instruction decoder, micro-op cache, how it impacts code windows for loops or other tight bodies of code, and how it impacts surrounding code. While saturating the shuffle port isn't ideal (particularly on older hardware where only 1 exists), saturating the arithmetic ports (where there are typically 2) is also bad.

When you factor in all this information across a broad range of scenarios, the current strategy ends up winning out.

(as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.

There isn't really piling up here. The bottleneck only comes when there is pipelining potential, which means you need two non-dependent shuffles that are within the same decode window and without any code that would block reordering between them. This is highly unlikely to occur in practice for any of the 3 strategies given. The more practical case for pipelining will come from the arithmetic ports, which you'd change to saturating with the alternative strategies and which due to the increased code size, reduces the chance that surrounding code will live in a window that allows it.

Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.

Shuffling is a fundamental part of SIMD and its why newer CPUs have been increasing the number of shuffle ports available. The even/odd split has slowly been falling out of favor due to how it works with masking, embedded loads, and other scenarios; correspondingly many of the "built-in" even/odd and pairwise instructions were never included in the upgrade to EVEX and don't have ZMM support; so you have to emulate (often with shuffling) in those case.

saucecontrol commented 2 days ago

More benchmark results:

Zen 5

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 435.8 ns | 0.17 ns | 0.16 ns |
| Stratergy1     | 425.1 ns | 0.36 ns | 0.30 ns |
| Stratergy2     | 457.1 ns | 0.60 ns | 0.47 ns |

Zen 4

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 442.0 ns | 0.61 ns | 0.57 ns |
| Stratergy1     | 564.2 ns | 0.38 ns | 0.35 ns |
| Stratergy2     | 591.3 ns | 0.39 ns | 0.37 ns |

Meteor Lake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Core Ultra 7 155H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

| Method         | Mean       | Error    | StdDev   |
|--------------- |-----------:|---------:|---------:|
| SimpleMultiply | 1,022.6 ns | 19.96 ns | 20.49 ns |
| Stratergy1     |   829.8 ns | 13.59 ns | 12.71 ns |
| Stratergy2     |   846.9 ns | 16.74 ns | 17.19 ns |
saucecontrol commented 2 days ago

Worth noting, the byte multiply codegen is different for AVX-512

vpmovzxbw ymm1, xmmword ptr [r11+rdi+0x10]
vpmovzxbw ymm2, xmmword ptr [r10+rdi+0x10]
vpmullw  ymm1, ymm2, ymm1
vpmovwb  ymm1, ymm1

Similarly, the and+or combo in the alternate versions is combined into a vpternlog on AVX-512.

Zen 5 results with only AVX2:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 447.6 ns | 0.46 ns | 0.41 ns |
| Stratergy1     | 447.4 ns | 0.86 ns | 0.81 ns |
| Stratergy2     | 496.3 ns | 1.73 ns | 1.54 ns |