Open IJzerbaard opened 2 days ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Simple unrolled benchmark on Rapter Lake:
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
SimpleMultiply | 739.7 ns | 8.74 ns | 8.18 ns | 340 B |
Stratergy1 | 577.5 ns | 2.70 ns | 2.40 ns | 393 B |
Stratergy2 | 584.8 ns | 3.84 ns | 3.59 ns | 413 B |
Code:
There are a lot of broader considerations that exist than simply whatever is fastest in an isolated benchmark.
Other considerations include factoring in a broad range of hardware, how the code impacts the loops (alignment, micro-op cache size, etc), simplicity, what other code exists around the code in question, etc.
There's also notably some notable differences here based on whether its XMM or YMM based.
For example, on Intel vpmovzxbw
takes 1 cycle for XMM but 3 for YMM and while there is only port 5 available pre-IceLake, on IceLake or late XMM has port 1 or 5 available. Alder Lake-E similarly reduced the cost of YMM to 1 cycle as well. -- AMD Zen2 and later similarly has 2 ports available (1 or 2) and takes 1 cycle for both XMM and YMM (it is similar to IceLake for Zen1).
vpmullw
takes 5 cycles and has 2 ports available (0 or 1). It improved to 3-4 cycles on Alder Lake-E. On AMD Zen, it is 3 cycles and has a single port available (0), but this increased to 2 ports on Zen3
vpand
takes 1 cycle and has 3 ports (0, 1, or 5) on Intel. On Zen it similarly takes 1 cycle and has 4 ports available (0, 1, 2, 3).
vpackuswb
is 1 cycle on Skylake but slowed to 3 cycles on IceLake+, only ever having port 5 available. On Zen it takes 1 cycle and has 2 ports available (1 or 2).
vpermq
takes 3 cycles and is only available on port 5. On Zen it is 4 cycles and has 2 ports (1 or 2)
From the other strategies, vpor
has the same throughput as vpand
. While vpsllw
and vpsrlw
take 1 cycle and have 2 ports available (0 or 1 on Intel and 1 or 2 on AMD). Finally vpmaddubsw
matches the cost of vpmullw
.
Given the above, the raw throughput potential (not factoring in memory accesses) is roughly:
Strategy | AMD | Intel |
---|---|---|
Current | 11 | 11-12 (XMM), 14-18 (YMM) |
1 | 15 | 15 |
2 | 15 | 15 |
The cost of a load is then approximately 3-7 cycles in typical cases and can impact normal pipelining and other considerations, so strategy 2 gains 12-28 cycles; while the current strategy and strategy 1 gain 9-21 cycles.
And despite pipelining being possible in some cases, the practical is that there are several longer dependency chains that will prevent code from executing in parallel in practice. So while theoretically strategy 1 can get the two vpmullw
dispatched with a 1 cycle delay from eachother (causing them to take 6, not 10), this isn't a guarantee and may be infeasible in many scenarios.
So what we're left with is that strategy 1 is theoretically the same speed as the current strategy for XMM, but potentially slower while taking up more uop cache, causing less dense code, etc. While for YMM its likely faster on older hardware and hits the same considerations as XMM on newer hardware.
And while strategy 2 is theoretically the same speed as strategy 1, the practical is it will always end up worse due to necessary memory accesses and more competing ports.
In the end, I don't think its worth making the current pattern which is fairly idiomatic more complex for a theoretical 20% perf increase that is unlikely to pan out in many real world scenarios, especially on newer harwdware.
As an example, here is the same benchmark @huoyaoyuan gave above running on some more hardware
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100 [Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method | Mean | Error | StdDev |
---|---|---|---|
SimpleMultiply | 1.006 us | 0.0067 us | 0.0052 us |
Stratergy1 | 1.142 us | 0.0159 us | 0.0149 us |
Stratergy2 | 1.215 us | 0.0074 us | 0.0062 us |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3) 11th Gen Intel Core i9-11900H 2.50GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100 [Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method | Mean | Error | StdDev |
---|---|---|---|
SimpleMultiply | 792.1 ns | 15.77 ns | 19.94 ns |
Stratergy1 | 859.9 ns | 17.22 ns | 20.50 ns |
Stratergy2 | 922.2 ns | 17.64 ns | 18.11 ns |
-- Will run on my AMD boxes when I get home tonight, and some various other older hardware I have as well.
OK those benchmark results are damning, but while I agree that this is context and hardware dependent, I cannot agree with the details of that analysis. "Throughput potential" is not a familiar measure to me, it's not the reciprocal throughput because that would have been lower, it's not latency because that also would have been lower (and probably shouldn't be the main target). I would suggest considering the execution port pressures on some µarchs.
If we're considering a variety of hypothetical contexts, strategies 1 and 2 having µops that spread better over the execution ports (as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.
Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.
it's not latency because that also would have been lower
The numbers given there are raw latency, not factoring in pipelining, decoding, or other considerations. For example, the current strategy (not factoring in memory loads) is 1+1+5+1+1+3
on Skylake (12 cycles) for XMM and 3+3+5+1+1+3
(15) for YMM. On IceLake YMM is 3+3+5+1+3+3
(18).
For strategy 1, you similarly have 5+1+1+5+1+1+1
(15). With pipelining, you can theoretically dispatch vpmullw
on port 0 and vpsrlw
on port 1 at the same time, then wait a cycle and reuse port 1 for vpsrlw
again, then wait a cycle and dispatch vpmullw
. You then need to wait 5 cycles before the third vpsllw
can execute (by which point the first vpmullw
is done). This gives you potentially 1+1+5+1+1+1
, getting you to 10 cycles in the best case.
There is, however, the cost of the loads as well and those are also impactful. In strategy 1, the first vpmullw
is dependent on 2 loads being completed before any work can start. These two loads are likely pipelined but still have a 3-7 cycle latency, so we have something like 5+1+1+5+1+5+1
(19) practical cycles.
While for the current strategy we have 5+5+5+1+1+3
(20) practical cycles, with newer hardware able to pipeline the two vpmovzxbw w/ embedded load
giving us instead 15 practical cycles.
This then repros in real world benchmarks between the two, when you factor in a broad range of hardware. On pre IceLake hardware you're going to see Strategy 1 being 5-20% faster, depending on several factors. While on IceLake and later hardware you're going to find almost the inverse, with the current strategy being 5-20% faster.
Beyond the raw numbers, there is also the real impact to the instruction decoder, micro-op cache, how it impacts code windows for loops or other tight bodies of code, and how it impacts surrounding code. While saturating the shuffle port isn't ideal (particularly on older hardware where only 1 exists), saturating the arithmetic ports (where there are typically 2) is also bad.
When you factor in all this information across a broad range of scenarios, the current strategy ends up winning out.
(as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.
There isn't really piling up here. The bottleneck only comes when there is pipelining potential, which means you need two non-dependent shuffles that are within the same decode window and without any code that would block reordering between them. This is highly unlikely to occur in practice for any of the 3 strategies given. The more practical case for pipelining will come from the arithmetic ports, which you'd change to saturating with the alternative strategies and which due to the increased code size, reduces the chance that surrounding code will live in a window that allows it.
Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.
Shuffling is a fundamental part of SIMD and its why newer CPUs have been increasing the number of shuffle ports available. The even/odd split has slowly been falling out of favor due to how it works with masking, embedded loads, and other scenarios; correspondingly many of the "built-in" even/odd and pairwise instructions were never included in the upgrade to EVEX and don't have ZMM support; so you have to emulate (often with shuffling) in those case.
More benchmark results:
Zen 5
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 435.8 ns | 0.17 ns | 0.16 ns |
| Stratergy1 | 425.1 ns | 0.36 ns | 0.30 ns |
| Stratergy2 | 457.1 ns | 0.60 ns | 0.47 ns |
Zen 4
BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 442.0 ns | 0.61 ns | 0.57 ns |
| Stratergy1 | 564.2 ns | 0.38 ns | 0.35 ns |
| Stratergy2 | 591.3 ns | 0.39 ns | 0.37 ns |
Meteor Lake
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Core Ultra 7 155H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|--------------- |-----------:|---------:|---------:|
| SimpleMultiply | 1,022.6 ns | 19.96 ns | 20.49 ns |
| Stratergy1 | 829.8 ns | 13.59 ns | 12.71 ns |
| Stratergy2 | 846.9 ns | 16.74 ns | 17.19 ns |
Worth noting, the byte multiply codegen is different for AVX-512
vpmovzxbw ymm1, xmmword ptr [r11+rdi+0x10]
vpmovzxbw ymm2, xmmword ptr [r10+rdi+0x10]
vpmullw ymm1, ymm2, ymm1
vpmovwb ymm1, ymm1
Similarly, the and+or combo in the alternate versions is combined into a vpternlog
on AVX-512.
Zen 5 results with only AVX2:
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 447.6 ns | 0.46 ns | 0.41 ns |
| Stratergy1 | 447.4 ns | 0.86 ns | 0.81 ns |
| Stratergy2 | 496.3 ns | 1.73 ns | 1.54 ns |
SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:
Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.
I suggest two alternative strategies.
Strategy 1
First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.
(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for
VPBROADCASTW
, but it would turn into a vector constant anyway so that is not important)By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.
In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),
This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).
Strategy 2
Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255-128 nor 255127 would saturate by themselves and the other element in each pair will be zero.
Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):
For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.
This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).
I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.