Vector256.Shuffle does not produce optimal codegen

adamsitnik commented 2 years ago

Detected in https://github.com/dotnet/runtime/pull/72788

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]   : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

Method	Mean	Ratio	Code Size
Vector256ShuffleConst	384.42 ns	27.29	230 B
Vector256ShuffleLocal	382.85 ns	27.18	230 B
AvxShuffleLocal	14.09 ns	1.00	127 B

Repro

```cs using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Configs; using BenchmarkDotNet.Diagnosers; using BenchmarkDotNet.Jobs; using BenchmarkDotNet.Running; using System; using System.Linq; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace ShufflePerf { internal class Program { static void Main() => BenchmarkRunner.Run( DefaultConfig.Instance .AddJob(Job.ShortRun) .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig()))); } public class Repro { private const uint Vector256ByteCount = 32; private const int BitsPerInt32 = 32; private int[] m_array = Enumerable.Range(0, 512).ToArray(); private bool[] boolArray = new bool[512 * 32]; private int m_length = 512; [Benchmark] public void Vector256ShuffleConst() { Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Vector256.Shuffle(scalar.AsByte(), Vector256.Create(0, 0x01010101_01010101, 0x02020202_02020202, 0x03030303_03030303).AsByte()); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } [Benchmark] public void Vector256ShuffleLocal() { Vector128 lowerShuffleMask_CopyToBoolArray = Vector128.Create(0, 0x01010101_01010101).AsByte(); Vector128 upperShuffleMask_CopyToBoolArray = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte(); Vector256 shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Vector256.Shuffle(scalar.AsByte(), shuffleMask); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } [Benchmark(Baseline = true)] public void AvxShuffleLocal() { Vector128 lowerShuffleMask_CopyToBoolArray = Vector128.Create(0, 0x01010101_01010101).AsByte(); Vector128 upperShuffleMask_CopyToBoolArray = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte(); Vector256 shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Avx2.Shuffle(scalar.AsByte(), shuffleMask); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } } } ``` ```xml Exe net7.0 ```

Disassembly

## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.Vector256ShuffleConst() sub rsp,98 vzeroupper vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+60],xmm4 vmovdqa xmmword ptr [rsp+70],xmm4 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb near ptr M00_L03 vmovupd ymm0,[7FF9BDA43AE0] M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae near ptr M00_L04 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vmovupd [rsp+20],ymm1 vmovupd [rsp+40],ymm0 xor r8d,r8d nop dword ptr [rax] M00_L01: lea r9,[rsp+40] movsxd r10,r8d movzx r9d,byte ptr [r9+r10] xor r11d,r11d cmp r9d,20 jge short M00_L02 lea r11,[rsp+20] mov r9d,r9d movzx r11d,byte ptr [r11+r9] M00_L02: lea r9,[rsp+60] mov [r9+r10],r11b inc r8d cmp r8d,20 jl short M00_L01 vmovupd ymm1,[rsp+60] mov r8d,edx vpand ymm1,ymm1,[7FF9BDA43B00] vpminub ymm1,ymm1,[7FF9BDA43B20] vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe near ptr M00_L00 M00_L03: vzeroupper add rsp,98 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 230 ``` ## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.Vector256ShuffleLocal() sub rsp,98 vzeroupper vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+60],xmm4 vmovdqa xmmword ptr [rsp+70],xmm4 vmovupd xmm0,[7FF9BDA53B40] vinserti128 ymm0,ymm0,xmmword ptr [7FF9BDA53B50],1 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb near ptr M00_L03 M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae near ptr M00_L04 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vmovupd [rsp+20],ymm1 vmovupd [rsp+40],ymm0 xor r8d,r8d M00_L01: lea r9,[rsp+40] movsxd r10,r8d movzx r9d,byte ptr [r9+r10] xor r11d,r11d cmp r9d,20 jge short M00_L02 lea r11,[rsp+20] mov r9d,r9d movzx r11d,byte ptr [r11+r9] M00_L02: lea r9,[rsp+60] mov [r9+r10],r11b inc r8d cmp r8d,20 jl short M00_L01 vmovupd ymm1,[rsp+60] mov r8d,edx vpand ymm1,ymm1,[7FF9BDA53B60] vpminub ymm1,ymm1,[7FF9BDA53B80] vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe near ptr M00_L00 M00_L03: vzeroupper add rsp,98 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 230 ``` ## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.AvxShuffleLocal() sub rsp,28 vzeroupper vmovupd xmm0,[7FF9BDA53BA0] vinserti128 ymm0,ymm0,xmmword ptr [7FF9BDA53BB0],1 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb short M00_L01 M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae short M00_L02 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vpshufb ymm1,ymm1,ymm0 vpand ymm1,ymm1,[7FF9BDA53BC0] vpminub ymm1,ymm1,[7FF9BDA53BE0] mov r8d,edx vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe short M00_L00 M00_L01: vzeroupper add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 127 ```

cc @tannergooding @EgorBo

category:cq theme:vector-codegen skill-level:intermediate cost:medium impact:small

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

Detected in https://github.com/dotnet/runtime/pull/72788 ```ini BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores .NET SDK=7.0.100-preview.6.22352.1 [Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ``` | Method | Mean | Ratio | Code Size | |---------------------- |----------:|------:|----------:| | Vector256ShuffleConst | 384.42 ns | 27.29 | 230 B | | Vector256ShuffleLocal | 382.85 ns | 27.18 | 230 B | | AvxShuffleLocal | 14.09 ns | 1.00 | 127 B | ### Repro

```cs using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Configs; using BenchmarkDotNet.Diagnosers; using BenchmarkDotNet.Jobs; using BenchmarkDotNet.Running; using System; using System.Linq; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace ShufflePerf { internal class Program { static void Main() => BenchmarkRunner.Run( DefaultConfig.Instance .AddJob(Job.ShortRun) .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig()))); } public class Repro { private const uint Vector256ByteCount = 32; private const int BitsPerInt32 = 32; private int[] m_array = Enumerable.Range(0, 512).ToArray(); private bool[] boolArray = new bool[512 * 32]; private int m_length = 512; [Benchmark] public void Vector256ShuffleConst() { Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Vector256.Shuffle(scalar.AsByte(), Vector256.Create(0, 0x01010101_01010101, 0x02020202_02020202, 0x03030303_03030303).AsByte()); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } [Benchmark] public void Vector256ShuffleLocal() { Vector128 lowerShuffleMask_CopyToBoolArray = Vector128.Create(0, 0x01010101_01010101).AsByte(); Vector128 upperShuffleMask_CopyToBoolArray = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte(); Vector256 shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Vector256.Shuffle(scalar.AsByte(), shuffleMask); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } [Benchmark(Baseline = true)] public void AvxShuffleLocal() { Vector128 lowerShuffleMask_CopyToBoolArray = Vector128.Create(0, 0x01010101_01010101).AsByte(); Vector128 upperShuffleMask_CopyToBoolArray = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte(); Vector256 shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector256 bitMask = Vector256.Create(0x80402010_08040201).AsByte(); Vector256 ones = Vector256.Create((byte)1); ref byte destination = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(boolArray)); for (uint i = 0; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256 scalar = Vector256.Create(bits); Vector256 shuffled = Avx2.Shuffle(scalar.AsByte(), shuffleMask); Vector256 extracted = shuffled & bitMask; Vector256 normalized = Vector256.Min(extracted, ones); normalized.StoreUnsafe(ref destination, new UIntPtr(i)); } } } } ``` ```xml Exe net7.0 ```

### Disassembly

## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.Vector256ShuffleConst() sub rsp,98 vzeroupper vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+60],xmm4 vmovdqa xmmword ptr [rsp+70],xmm4 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb near ptr M00_L03 vmovupd ymm0,[7FF9BDA43AE0] M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae near ptr M00_L04 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vmovupd [rsp+20],ymm1 vmovupd [rsp+40],ymm0 xor r8d,r8d nop dword ptr [rax] M00_L01: lea r9,[rsp+40] movsxd r10,r8d movzx r9d,byte ptr [r9+r10] xor r11d,r11d cmp r9d,20 jge short M00_L02 lea r11,[rsp+20] mov r9d,r9d movzx r11d,byte ptr [r11+r9] M00_L02: lea r9,[rsp+60] mov [r9+r10],r11b inc r8d cmp r8d,20 jl short M00_L01 vmovupd ymm1,[rsp+60] mov r8d,edx vpand ymm1,ymm1,[7FF9BDA43B00] vpminub ymm1,ymm1,[7FF9BDA43B20] vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe near ptr M00_L00 M00_L03: vzeroupper add rsp,98 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 230 ``` ## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.Vector256ShuffleLocal() sub rsp,98 vzeroupper vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+60],xmm4 vmovdqa xmmword ptr [rsp+70],xmm4 vmovupd xmm0,[7FF9BDA53B40] vinserti128 ymm0,ymm0,xmmword ptr [7FF9BDA53B50],1 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb near ptr M00_L03 M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae near ptr M00_L04 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vmovupd [rsp+20],ymm1 vmovupd [rsp+40],ymm0 xor r8d,r8d M00_L01: lea r9,[rsp+40] movsxd r10,r8d movzx r9d,byte ptr [r9+r10] xor r11d,r11d cmp r9d,20 jge short M00_L02 lea r11,[rsp+20] mov r9d,r9d movzx r11d,byte ptr [r11+r9] M00_L02: lea r9,[rsp+60] mov [r9+r10],r11b inc r8d cmp r8d,20 jl short M00_L01 vmovupd ymm1,[rsp+60] mov r8d,edx vpand ymm1,ymm1,[7FF9BDA53B60] vpminub ymm1,ymm1,[7FF9BDA53B80] vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe near ptr M00_L00 M00_L03: vzeroupper add rsp,98 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 230 ``` ## .NET 7.0.0 (7.0.22.32404), X64 RyuJIT ```assembly ; ShufflePerf.Repro.AvxShuffleLocal() sub rsp,28 vzeroupper vmovupd xmm0,[7FF9BDA53BA0] vinserti128 ymm0,ymm0,xmmword ptr [7FF9BDA53BB0],1 mov rax,[rcx+10] cmp [rax],al add rax,10 xor edx,edx cmp dword ptr [rcx+18],20 jb short M00_L01 M00_L00: mov r8,[rcx+8] mov r9d,edx shr r9d,5 cmp r9d,[r8+8] jae short M00_L02 mov r9d,r9d vpbroadcastd ymm1,dword ptr [r8+r9*4+10] vpshufb ymm1,ymm1,ymm0 vpand ymm1,ymm1,[7FF9BDA53BC0] vpminub ymm1,ymm1,[7FF9BDA53BE0] mov r8d,edx vmovdqu ymmword ptr [rax+r8],ymm1 add edx,20 lea r8d,[rdx+20] cmp r8d,[rcx+18] jbe short M00_L00 M00_L01: vzeroupper add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 127 ```

cc @tannergooding @EgorBo

Author:	adamsitnik
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding commented 2 years ago

Due to limitations in the .NET 7 JIT (timing issues), the mask needs to be a "constant", that is it needs to be:

Vector128.Shuffle(value, Vector128.Create(cns, ..., cns));
Vector256.Shuffle(value, Vector256.Create(cns, ..., cns));

Using locals or other indirections may currently break the detection that the mask is a "constant" due to where the check occurs (importation). In .NET 8, we need to ensure the "fallback" is handled by the JIT and that the detection of "is this a constant" occurs later in morph so that we've had the ability to convert LCL_VAR to CNS_VEC and do other constant propagation.

EgorBo commented 2 years ago

Vector256ShuffleConst does use constant mask as the argument

EgorBo commented 2 years ago

Yeah, Vector256.Shuffle is not accelerated no matter what input is for byte @adamsitnik if it works for your algorithm you may try other overloads like int for Shuffle - those works (or shuffle bytes with two V128)

tannergooding commented 2 years ago

Vector256.Shuffle is not accelerated no matter what input is for byte

That doesn't seem right and a quick check shows that isn't the case. This is likely an issue with treating Vector256.Shuffle as "identical to" Avx2.Shuffle when it isn't.

Avx2.Shuffle is effectively 2x128-bit ops and so if you do Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L)) its going to think you want value[0], value[1], value[0], value[1]. Where-as Avx2.Shuffle treats this as value[0], value[1], value[2], value[3].

That is, Avx2.Shuffle splits the mask in half and effectively does:

Vector128<long> lower = Vector128.Shuffle(value.GetLower(), mask.GetLower());
Vector128<long> upper = Vector128.Shuffle(value.GetUpper(), mask.GetUpper() + Vector128.Create((long)Vector128<long>.Count));
return Vector256.Create(lower, upper);

While Vector256.Shuffle treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where AVX-512 and SVE all operate on "full width".

Vector256.Shuffle on AVX2 therefore has to pessimize for byte when you "cross lanes" and is something that should likewise be improved in .NET 8

EgorBo commented 2 years ago

That doesn't seem right and a quick check shows that isn't the case.

public Vector256<byte> Vector256ShuffleConst(Vector256<byte> vec)
{
    return Vector256.Shuffle(vec, 
        Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));
}


G_M6489_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4881EC98000000       sub      rsp, 152
       C5F877               vzeroupper 
       C5D857E4             vxorps   xmm4, xmm4
       C5F97F642460         vmovdqa  xmmword ptr [rsp+60H], xmm4
       C5F97F642470         vmovdqa  xmmword ptr [rsp+70H], xmm4
       488BF2               mov      rsi, rdx
                        ;; size=33 bbWeight=1    PerfScore 9.83
G_M6489_IG02:              ;; offset=0021H
       C4C17D1000           vmovupd  ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       C5FD1005EC000000     vmovupd  ymm0, ymmword ptr[reloc @RWD00]
       C5FD11442440         vmovupd  ymmword ptr[rsp+40H], ymm0
       33FF                 xor      edi, edi
                        ;; size=27 bbWeight=1    PerfScore 11.25
G_M6489_IG03:              ;; offset=003CH
       85FF                 test     edi, edi
       7C0C                 jl       SHORT G_M6489_IG05
                        ;; size=4 bbWeight=4    PerfScore 5.00
G_M6489_IG04:              ;; offset=0040H
       33C9                 xor      ecx, ecx
       83FF20               cmp      edi, 32
       0F9CC1               setl     cl
       84C9                 test     cl, cl
       7516                 jne      SHORT G_M6489_IG06
                        ;; size=12 bbWeight=2    PerfScore 5.50
G_M6489_IG05:              ;; offset=004CH
       48B92820805B59020000 mov      rcx, 0x2595B802028      ; ""
       488B11               mov      rdx, gword ptr [rcx]
       488BCA               mov      rcx, rdx
       FF154E580F00         call     [System.Diagnostics.Debug:Fail(System.String,System.String)]
                        ;; size=22 bbWeight=2    PerfScore 11.00
G_M6489_IG06:              ;; offset=0062H
       488D4C2440           lea      rcx, bword ptr [rsp+40H]
       4863D7               movsxd   rdx, edi
       0FB61C11             movzx    rbx, byte  ptr [rcx+rdx]
       33ED                 xor      ebp, ebp
       83FB20               cmp      ebx, 32
       7D2E                 jge      SHORT G_M6489_IG09
                        ;; size=19 bbWeight=4    PerfScore 17.00
G_M6489_IG07:              ;; offset=0075H
       33C9                 xor      ecx, ecx
       83FB20               cmp      ebx, 32
       0F9CC1               setl     cl
       84C9                 test     cl, cl
       7516                 jne      SHORT G_M6489_IG08
       48B92820805B59020000 mov      rcx, 0x2595B802028      ; ""
       488B11               mov      rdx, gword ptr [rcx]
       488BCA               mov      rcx, rdx
       FF1519580F00         call     [System.Diagnostics.Debug:Fail(System.String,System.String)]
                        ;; size=34 bbWeight=2    PerfScore 16.50
G_M6489_IG08:              ;; offset=0097H
       488D4C2420           lea      rcx, bword ptr [rsp+20H]
       8BD3                 mov      edx, ebx
       400FB62C11           movzx    rbp, byte  ptr [rcx+rdx]
                        ;; size=12 bbWeight=2    PerfScore 5.50
G_M6489_IG09:              ;; offset=00A3H
       85FF                 test     edi, edi
       7C0C                 jl       SHORT G_M6489_IG11
                        ;; size=4 bbWeight=4    PerfScore 5.00
G_M6489_IG10:              ;; offset=00A7H
       33C9                 xor      ecx, ecx
       83FF20               cmp      edi, 32
       0F9CC1               setl     cl
       84C9                 test     cl, cl
       7516                 jne      SHORT G_M6489_IG12
                        ;; size=12 bbWeight=2    PerfScore 5.50
G_M6489_IG11:              ;; offset=00B3H
       48B92820805B59020000 mov      rcx, 0x2595B802028      ; ""
       488B11               mov      rdx, gword ptr [rcx]
       488BCA               mov      rcx, rdx
       FF15E7570F00         call     [System.Diagnostics.Debug:Fail(System.String,System.String)]
                        ;; size=22 bbWeight=2    PerfScore 11.00
G_M6489_IG12:              ;; offset=00C9H
       488D442460           lea      rax, bword ptr [rsp+60H]
       4863D7               movsxd   rdx, edi
       40882C10             mov      byte  ptr [rax+rdx], bpl
       FFC7                 inc      edi
       83FF20               cmp      edi, 32
       0F8C5CFFFFFF         jl       G_M6489_IG03
                        ;; size=23 bbWeight=4    PerfScore 13.00
G_M6489_IG13:              ;; offset=00E0H
       C5FD10442460         vmovupd  ymm0, ymmword ptr[rsp+60H]
       C5FD1106             vmovupd  ymmword ptr[rsi], ymm0
       488BC6               mov      rax, rsi
                        ;; size=13 bbWeight=1    PerfScore 6.25
G_M6489_IG14:              ;; offset=00EDH
       C5F877               vzeroupper 
       4881C498000000       add      rsp, 152
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
                        ;; size=15 bbWeight=1    PerfScore 4.25
RWD00   dq  0000000000000000h, 0101010101010101h, 0202020202020202h, 0303030303030303h

; Total bytes of code 252

it's Checked so asserts are there but I assume it still does not use the intrinsified path

tannergooding commented 2 years ago

As explained, this is because you're crossing lanes. For upper, you're selecting lower[2] and lower[3].

Change it to:

return Vector256.Shuffle(vec, 
        Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19));

That way its selecting lower[0], lower[1] and upper[18 - 16], upper[19 - 16].

AndyAyersMS commented 2 years ago

@tannergooding can you set a milestone and update area path here as appropriate?

tannergooding commented 1 year ago

Various cases were improved for AVX-512 capable hardware where newer instructions are available.

Other cases still expect constant inputs.

tannergooding commented 3 months ago

This was resolved with https://github.com/dotnet/runtime/pull/102702

dotnet / runtime

Vector256.Shuffle does not produce optimal codegen #72793

Repro

Disassembly