Closed kg closed 4 days ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Do you have a self-contained example? For
public static unsafe int Foo(Vector128<byte> bucket, byte needle)
{
var haystack = (byte*)Unsafe.AsPointer(ref bucket);
var result = 32;
Lane(haystack, needle, ref result, 13);
Lane(haystack, needle, ref result, 12);
Lane(haystack, needle, ref result, 11);
Lane(haystack, needle, ref result, 10);
Lane(haystack, needle, ref result, 9);
Lane(haystack, needle, ref result, 8);
Lane(haystack, needle, ref result, 7);
Lane(haystack, needle, ref result, 6);
Lane(haystack, needle, ref result, 5);
Lane(haystack, needle, ref result, 4);
Lane(haystack, needle, ref result, 3);
Lane(haystack, needle, ref result, 2);
Lane(haystack, needle, ref result, 1);
Lane(haystack, needle, ref result, 0);
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Lane(byte* haystack, byte needle, ref int result, int index)
{
if (haystack[index] == needle)
result = index;
}
I get
; Method Program:Foo(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):int (FullOpts)
G_M56088_IG01: ;; offset=0x0000
;; size=0 bbWeight=1 PerfScore 0.00
G_M56088_IG02: ;; offset=0x0000
mov rax, rcx
mov ecx, 32
movzx r8, byte ptr [rax+0x0D]
movzx rdx, dl
mov r10d, 13
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0C]
mov r10d, 12
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0B]
mov r10d, 11
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0A]
mov r10d, 10
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x09]
mov r10d, 9
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x08]
mov r10d, 8
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x07]
mov r10d, 7
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x06]
mov r10d, 6
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x05]
mov r10d, 5
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x04]
mov r10d, 4
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x03]
mov r10d, 3
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x02]
mov r10d, 2
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x01]
mov r10d, 1
cmp r8d, edx
cmove ecx, r10d
movzx rax, byte ptr [rax]
xor r8d, r8d
cmp eax, edx
cmove ecx, r8d
mov eax, ecx
;; size=259 bbWeight=1 PerfScore 39.50
G_M56088_IG03: ;; offset=0x0103
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 260
on main.
Note that if your code is inside a loop then it is expected that we do not emit conditional selects, see https://github.com/dotnet/runtime/issues/96380#issuecomment-1873264530.
It's due to a loop, thank you. We can close this if that limitation has a tracking issue.
Description
I'm attempting to unroll a fixed-size scalar search loop so that it will generate a chain of cmovs. It's possible to get this to happen in C compiled by clang and msvc, but not C# on .NET 9. Instead, the best-case is a chain of
movzx; cmp; jne; mov
opcode clusters, i.e.(For reference, this is the same method as https://github.com/dotnet/runtime/issues/108092 but with the vectorization mostly disabled.)
Configuration
.NET 9 Windows 11 Pro build 22631 x64 (Ryzen 7950X) Disasmo spew:
Regression?
No, this didn't work on 8 either.
Reproduction code
Here's the simplest/easiest-to-fiddle-with version of the repro I can come up with that's 'real'. the C version in simdhash is similar and generates a chain of cmovs the last time I checked. The idea here is to fully pipeline the fixed-width (14 elements) search loop. There's no guarantee it would be faster (it was in C), but I can't actually measure it to see whether it's faster because I can't get it to generate the cmovs :)
Additional notes
I've tried maybe a dozen different formulations/structures for this code and they all generate similar or worse code, none of them generate cmovs. I've verified that cmovs are generated in other parts of my code - this method even has one further up:
I've also tried reducing the number of Lane calls to see if that simplification would light up the cmov intrinsic but it didn't. I've also tried manually inlining Lane to no effect.