sfiruch commented 1 month ago

The following applies to Vector128, Vector256 and Vector512 - but I will discuss only the Vector256 case.

Let's consider this example which permutes a Vector256:

Vector256<float> Test(Vector256<float> v) => Vector256.Shuffle(v,Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4));

Actual Behavior

    L0000: vzeroupper
    L0003: vmovups ymm0, [0x7ff7ea7d0180]
    L000b: vpermps ymm0, ymm0, [rdx]
    L0010: vmovups [rcx], ymm0
    L0014: mov rax, rcx
    L0017: vzeroupper
    L001a: ret

The runtime used the VPERM instruction to permute the elements, which takes a vector of indices as parameter. This means we have to load the constant from memory.

Expected Behavior

There are other, faster instructions (SHUFPD, SHUFPS, PSHUFD, VPSHUFD, VSHUFPS, VSHUFPD, VPERMILPS, VPERMILPD, VPERM2I128, VPERM2D128, VSHUFF64x2, VSHUFF32x4, VSHUFI32x4, VSHUFI64x2) which don't quite offer the same flexibility as the VPERM instruction, but should be emitted when they are applicable. Those functions take an imm8 control byte to describe the permutation. It's easy to detect if one of these cheaper functions is useable. In the case shown above SHUFPS is applicable, because the permutations in the first four elements and the last four elements are the same.

When I manually rewrite the function to use SHUFPS

Vector256<float> Test(Vector256<float> v) => Avx.Shuffle(v, v, 0b00_01_10_11);

the runtime produces this faster and smaller code:

    L0000: vzeroupper
    L0003: vmovups ymm0, [rdx]
    L0007: vshufps ymm0, ymm0, ymm0, 0x1b
    L000c: vmovups [rcx], ymm0
    L0010: mov rax, rcx
    L0013: vzeroupper
    L0016: ret

In other cases the .Shuffle can be replaced with Avx512F.Permute4x64, Avx512F.Permute4x32, Sse.Shuffle, ...

Stretch Goal

Reversing a vector with Vector256.Shuffle(v, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0)); could be special-cased to

v = Avx.Permute(v, 0b_00_01_10_11);
v = Avx.Permute2x128(v, v, 0b_00_00_00_01);

This runs approx. twice as fast on my machine, and is used in some places like SpanHelpers.Reverse.

Effects

In my implementation of a SIMD-accelerated Bitonic sorting network I saw a performance uplift of approx 20% and a significant code size reduction.


BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22631.3880)
11th Gen Intel Core i5-1135G7 2.40GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.303
  [Host]   : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
OptimizedShuffle	5.241 ns	4.301 ns	0.2358 ns	0.82	0.05	387 B
DefaultShuffle	6.373 ns	5.072 ns	0.2780 ns	1.00	0.00	488 B

Benchmark code

```cs [DisassemblyDiagnoser] [ShortRunJob] public class BitonicSortTests { [Benchmark] public (Vector256, Vector256) OptimizedShuffle() => OptimizedShuffleImpl(Vector256.Zero); [Benchmark(Baseline = true)] public (Vector256, Vector256) DefaultShuffle() => DefaultShuffleImpl(Vector256.Zero); // Sort2 Sort4Rev Sort2 Sort8Rev Sort4 Sort2 // 0 -------#----------#-------------#--------#---------------#---------------#-------- // | | | | | | // 1 -------#----------+-#-----------#--------+-#-------------+-#-------------#-------- // | | | | | | // 2 -------#----------+-#-----------#--------+-+-#-----------#-+-------------#-------- // | | | | | | | | // 3 -------#----------#-------------#--------+-+-+-#-----------#-------------#-------- // | | | | // 4 -------#----------#-------------#--------+-+-+-#---------#---------------#-------- // | | | | | | | | // 5 -------#----------|-#-----------#--------+-+-#-----------+-#-------------#-------- // | | | | | | // 6 -------#----------|-#-----------#--------+-#-------------#-+-------------#-------- // | | | | | | // 7 -------#----------#-------------#--------#-----------------#-------------#-------- [MethodImpl(MethodImplOptions.AggressiveInlining)] static Vector256 Reverse32(Vector256 _v) { _v = Avx.Permute(_v, 0b_00_01_10_11); //3210 7654 return Avx.Permute2x128(_v, _v, 0b_00_00_00_01); } static (Vector256, Vector256) OptimizedShuffleImpl(Vector256 lambda) { var indices = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7).AsSingle(); [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort8Rev() { var lambdaShuf = Reverse32(lambda); var comp = Vector128.LessThan(lambda.GetLower(), lambdaShuf.GetLower()); var mask = Vector256.Create(comp, Sse.Shuffle(comp, comp, 0b_00_01_10_11)); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Reverse32(indices)); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort4Rev() { var lambdaShuf = Avx.Shuffle(lambda, lambda, 0b00_01_10_11); var mask = Vector256.LessThan(lambda, lambdaShuf); mask = Avx.Shuffle(mask, mask, 0b00_01_01_00); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b00_01_10_11)); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort4() { var lambdaPerm = Avx.Shuffle(lambda, lambda, 0b01_00_11_10); var mask = Vector256.LessThan(lambda, lambdaPerm); mask = Avx.Shuffle(mask, mask, 0b01_00_01_00); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaPerm); indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b01_00_11_10)); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort2() { var lambdaShuf = Avx.Shuffle(lambda, lambda, 0b10_11_00_01); var mask = Vector256.LessThan(lambda, lambdaShuf); mask = Avx.Shuffle(mask, mask, 0b10_10_00_00); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b10110001)); } Sort2(); Sort4Rev(); Sort2(); Sort8Rev(); Sort4(); Sort2(); return (lambda, indices.AsInt32()); } static (Vector256, Vector256) DefaultShuffleImpl(Vector256 lambda) { var indices = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7).AsSingle(); [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort8Rev() { var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0)); var comp = Vector128.LessThan(lambda.GetLower(), lambdaShuf.GetLower()); var mask = Vector256.Create(comp, Vector128.Shuffle(comp, Vector128.Create(3, 2, 1, 0))); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0))); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort4Rev() { var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4)); var mask = Vector256.LessThan(lambda, lambdaShuf); mask = Vector256.Shuffle(mask, Vector256.Create(0, 1, 1, 0, 4, 5, 5, 4)); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4))); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort4() { var lambdaPerm = Vector256.Shuffle(lambda, Vector256.Create(2, 3, 0, 1, 6, 7, 4, 5)); var mask = Vector256.LessThan(lambda, lambdaPerm); mask = Vector256.Shuffle(mask, Vector256.Create(0, 1, 0, 1, 4, 5, 4, 5)); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaPerm); indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(2, 3, 0, 1, 6, 7, 4, 5))); } [MethodImpl(MethodImplOptions.AggressiveInlining)] void Sort2() { var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(1, 0, 3, 2, 5, 4, 7, 6)); var mask = Vector256.LessThan(lambda, lambdaShuf); mask = Vector256.Shuffle(mask, Vector256.Create(0, 0, 2, 2, 4, 4, 6, 6)); lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf); indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(1, 0, 3, 2, 5, 4, 7, 6))); } Sort2(); Sort4Rev(); Sort2(); Sort8Rev(); Sort4(); Sort2(); return (lambda, indices.AsInt32()); } } ```

Context

I think platform-agnostic functions like Vector.Shuffle are preferrable over x64-specific Avx.Shuffle. In #101251 similar goals are outlined.
Vector.Create has to be called explicitly for each Vector.Shuffle call because otherwise suboptimal code is generated (#72793)

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

tannergooding commented 1 month ago

Thanks for opening the issue here, I've assigned it to myself as this is an area I'm already working in/around.

Some things to note...

Vector.Create has to be called explicitly for each Vector.Shuffle call because otherwise suboptimal code is generated

This is no longer true, it is fixed in .NET 9

This means we have to load the constant from memory.

This is only true for very simplistic cases. If you're doing this in a loop, for example, the indices parameter will be hoisted and come from register instead. The CPU is also likely to prefetch and cache these RIP relative constants itself as well, resulting in the memory load being effectively erased for typical code.

It's easy to detect if one of these cheaper functions is useable.

It's not as easy as one might think. There are many aspects that come into play including register selection, neighboring instructions/ports, dependency chains, whether a memory load can be hoisted, whether new encoding semantics relevant to EVEX are possible, whether it crosses lanes or not, how it impacts loop and general code alignment, etc.

There are some cases where we can produce better instruction selection and those are already known/being tracked, they just couldn't be completed in .NET 9. But, in general, this will be done by taking into account the right balance between complexity, maintainability, code size, and code perf across a range of microarchitectures; it may not always be what appears to have the fewest theoretical cycles.

sfiruch commented 1 month ago

Thanks for opening the issue here, I've assigned it to myself as this is an area I'm already working in/around.

Thank you for working on this, and improving things! :)

This means we have to load the constant from memory.

This is only true for very simplistic cases. If you're doing this in a loop, for example, the indices parameter will be hoisted and come from register instead. The CPU is also likely to prefetch and cache these RIP relative constants itself as well, resulting in the memory load being effectively erased for typical code.

Just for context: In my case, I'm not using it in a loop.

(I also saw many cases where smaller code is actually better. I encountered many scenarios where -Os outperformed -O2.)

It's easy to detect if one of these cheaper functions is useable.

It's not as easy as one might think. There are many aspects that come into play including register selection, neighboring instructions/ports, dependency chains, whether a memory load can be hoisted, whether new encoding semantics relevant to EVEX are possible, whether it crosses lanes or not, how it impacts loop and general code alignment, etc.

I think the shown case is clear. VSHUFPS uses one fewer registers, performs equal or better on every Intel since Skylake and AMD CPU since Zen, runs on more execution ports, fewer uops, lower latency.

VSHUFPS can be used for Vector256.Shuffle(Vector256<Single>, mask) iff mask[0]+4==mask[4] && mask[1]+4==mask[5] && mask[2]+4==mask[6] && mask[3]+4==mask[7]. Other uses have similarly complex rules. I suspect that finding the mask values in the JIT might be the complex part. The rest should be "easy" :)

There are some cases where we can produce better instruction selection and those are already known/being tracked, they just couldn't be completed in .NET 9. But, in general, this will be done by taking into account the right balance between complexity, maintainability, code size, and code perf across a range of microarchitectures; it may not always be what appears to have the fewest theoretical cycles.

This is why I provided benchmarks.

dotnet / runtime

Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

Actual Behavior

Expected Behavior

Stretch Goal

Effects

Context