Closed adamsitnik closed 3 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
Due to limitations in the .NET 7 JIT (timing issues), the mask
needs to be a "constant", that is it needs to be:
Vector128.Shuffle(value, Vector128.Create(cns, ..., cns));
Vector256.Shuffle(value, Vector256.Create(cns, ..., cns));
Using locals or other indirections may currently break the detection that the mask
is a "constant" due to where the check occurs (importation). In .NET 8, we need to ensure the "fallback" is handled by the JIT and that the detection of "is this a constant" occurs later in morph
so that we've had the ability to convert LCL_VAR
to CNS_VEC
and do other constant propagation.
Vector256ShuffleConst
does use constant mask as the argument
Yeah, Vector256.Shuffle is not accelerated no matter what input is for byte
@adamsitnik if it works for your algorithm you may try other overloads like int
for Shuffle - those works (or shuffle bytes with two V128)
Vector256.Shuffle is not accelerated no matter what input is for byte
That doesn't seem right and a quick check shows that isn't the case. This is likely an issue with treating Vector256.Shuffle
as "identical to" Avx2.Shuffle
when it isn't.
Avx2.Shuffle
is effectively 2x128-bit ops
and so if you do Vector256.Shuffle(value, Vector256.Create(0L, 1L, 0L, 1L))
its going to think you want value[0], value[1], value[0], value[1]
. Where-as Avx2.Shuffle
treats this as value[0], value[1], value[2], value[3]
.
That is, Avx2.Shuffle
splits the mask in half and effectively does:
Vector128<long> lower = Vector128.Shuffle(value.GetLower(), mask.GetLower());
Vector128<long> upper = Vector128.Shuffle(value.GetUpper(), mask.GetUpper() + Vector128.Create((long)Vector128<long>.Count));
return Vector256.Create(lower, upper);
While Vector256.Shuffle
treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where AVX-512
and SVE
all operate on "full width".
Vector256.Shuffle
on AVX2 therefore has to pessimize for byte
when you "cross lanes" and is something that should likewise be improved in .NET 8
That doesn't seem right and a quick check shows that isn't the case.
public Vector256<byte> Vector256ShuffleConst(Vector256<byte> vec)
{
return Vector256.Shuffle(vec,
Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));
}
G_M6489_IG01: ;; offset=0000H
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4881EC98000000 sub rsp, 152
C5F877 vzeroupper
C5D857E4 vxorps xmm4, xmm4
C5F97F642460 vmovdqa xmmword ptr [rsp+60H], xmm4
C5F97F642470 vmovdqa xmmword ptr [rsp+70H], xmm4
488BF2 mov rsi, rdx
;; size=33 bbWeight=1 PerfScore 9.83
G_M6489_IG02: ;; offset=0021H
C4C17D1000 vmovupd ymm0, ymmword ptr[r8]
C5FD11442420 vmovupd ymmword ptr[rsp+20H], ymm0
C5FD1005EC000000 vmovupd ymm0, ymmword ptr[reloc @RWD00]
C5FD11442440 vmovupd ymmword ptr[rsp+40H], ymm0
33FF xor edi, edi
;; size=27 bbWeight=1 PerfScore 11.25
G_M6489_IG03: ;; offset=003CH
85FF test edi, edi
7C0C jl SHORT G_M6489_IG05
;; size=4 bbWeight=4 PerfScore 5.00
G_M6489_IG04: ;; offset=0040H
33C9 xor ecx, ecx
83FF20 cmp edi, 32
0F9CC1 setl cl
84C9 test cl, cl
7516 jne SHORT G_M6489_IG06
;; size=12 bbWeight=2 PerfScore 5.50
G_M6489_IG05: ;; offset=004CH
48B92820805B59020000 mov rcx, 0x2595B802028 ; ""
488B11 mov rdx, gword ptr [rcx]
488BCA mov rcx, rdx
FF154E580F00 call [System.Diagnostics.Debug:Fail(System.String,System.String)]
;; size=22 bbWeight=2 PerfScore 11.00
G_M6489_IG06: ;; offset=0062H
488D4C2440 lea rcx, bword ptr [rsp+40H]
4863D7 movsxd rdx, edi
0FB61C11 movzx rbx, byte ptr [rcx+rdx]
33ED xor ebp, ebp
83FB20 cmp ebx, 32
7D2E jge SHORT G_M6489_IG09
;; size=19 bbWeight=4 PerfScore 17.00
G_M6489_IG07: ;; offset=0075H
33C9 xor ecx, ecx
83FB20 cmp ebx, 32
0F9CC1 setl cl
84C9 test cl, cl
7516 jne SHORT G_M6489_IG08
48B92820805B59020000 mov rcx, 0x2595B802028 ; ""
488B11 mov rdx, gword ptr [rcx]
488BCA mov rcx, rdx
FF1519580F00 call [System.Diagnostics.Debug:Fail(System.String,System.String)]
;; size=34 bbWeight=2 PerfScore 16.50
G_M6489_IG08: ;; offset=0097H
488D4C2420 lea rcx, bword ptr [rsp+20H]
8BD3 mov edx, ebx
400FB62C11 movzx rbp, byte ptr [rcx+rdx]
;; size=12 bbWeight=2 PerfScore 5.50
G_M6489_IG09: ;; offset=00A3H
85FF test edi, edi
7C0C jl SHORT G_M6489_IG11
;; size=4 bbWeight=4 PerfScore 5.00
G_M6489_IG10: ;; offset=00A7H
33C9 xor ecx, ecx
83FF20 cmp edi, 32
0F9CC1 setl cl
84C9 test cl, cl
7516 jne SHORT G_M6489_IG12
;; size=12 bbWeight=2 PerfScore 5.50
G_M6489_IG11: ;; offset=00B3H
48B92820805B59020000 mov rcx, 0x2595B802028 ; ""
488B11 mov rdx, gword ptr [rcx]
488BCA mov rcx, rdx
FF15E7570F00 call [System.Diagnostics.Debug:Fail(System.String,System.String)]
;; size=22 bbWeight=2 PerfScore 11.00
G_M6489_IG12: ;; offset=00C9H
488D442460 lea rax, bword ptr [rsp+60H]
4863D7 movsxd rdx, edi
40882C10 mov byte ptr [rax+rdx], bpl
FFC7 inc edi
83FF20 cmp edi, 32
0F8C5CFFFFFF jl G_M6489_IG03
;; size=23 bbWeight=4 PerfScore 13.00
G_M6489_IG13: ;; offset=00E0H
C5FD10442460 vmovupd ymm0, ymmword ptr[rsp+60H]
C5FD1106 vmovupd ymmword ptr[rsi], ymm0
488BC6 mov rax, rsi
;; size=13 bbWeight=1 PerfScore 6.25
G_M6489_IG14: ;; offset=00EDH
C5F877 vzeroupper
4881C498000000 add rsp, 152
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
C3 ret
;; size=15 bbWeight=1 PerfScore 4.25
RWD00 dq 0000000000000000h, 0101010101010101h, 0202020202020202h, 0303030303030303h
; Total bytes of code 252
it's Checked so asserts are there but I assume it still does not use the intrinsified path
As explained, this is because you're crossing lanes. For upper
, you're selecting lower[2]
and lower[3]
.
Change it to:
return Vector256.Shuffle(vec,
Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19));
That way its selecting lower[0]
, lower[1]
and upper[18 - 16]
, upper[19 - 16]
.
@tannergooding can you set a milestone and update area path here as appropriate?
Various cases were improved for AVX-512 capable hardware where newer instructions are available.
Other cases still expect constant inputs.
This was resolved with https://github.com/dotnet/runtime/pull/102702
Detected in https://github.com/dotnet/runtime/pull/72788
Repro
Disassembly
cc @tannergooding @EgorBo
category:cq theme:vector-codegen skill-level:intermediate cost:medium impact:small