dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.44k stars 4.76k forks source link

cmov not being used in nested hand-unrolled loop #109940

Closed kg closed 4 days ago

kg commented 1 week ago

Description

I'm attempting to unroll a fixed-size scalar search loop so that it will generate a chain of cmovs. It's possible to get this to happen in C compiled by clang and msvc, but not C# on .NET 9. Instead, the best-case is a chain of movzx; cmp; jne; mov opcode clusters, i.e.

       movzx    r11, byte  ptr [rcx]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG08

G_M000_IG07:                ;; offset=0x00C0
       xor      edx, edx

G_M000_IG08:                ;; offset=0x00C2
       movzx    r11, byte  ptr [rcx+0x01]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG10

G_M000_IG09:                ;; offset=0x00CC
       mov      edx, 1

(For reference, this is the same method as https://github.com/dotnet/runtime/issues/108092 but with the vectorization mostly disabled.)

Configuration

.NET 9 Windows 11 Pro build 22631 x64 (Ryzen 7950X) Disasmo spew:

; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 10 single block inlinees; 22 inlinees without PGO data

Regression?

No, this didn't work on 8 either.

Reproduction code

Here's the simplest/easiest-to-fiddle-with version of the repro I can come up with that's 'real'. the C version in simdhash is similar and generates a chain of cmovs the last time I checked. The idea here is to fully pipeline the fixed-width (14 elements) search loop. There's no guarantee it would be faster (it was in C), but I can't actually measure it to see whether it's faster because I can't get it to generate the cmovs :)

Vector128<byte> bucket = ...;
byte needle = ...;
var haystack = (byte*)Unsafe.AsPointer(ref bucket);
var result = 32;
Lane(haystack, needle, ref result, 13);
Lane(haystack, needle, ref result, 12);
Lane(haystack, needle, ref result, 11);
Lane(haystack, needle, ref result, 10);
Lane(haystack, needle, ref result, 9);
Lane(haystack, needle, ref result, 8);
Lane(haystack, needle, ref result, 7);
Lane(haystack, needle, ref result, 6);
Lane(haystack, needle, ref result, 5);
Lane(haystack, needle, ref result, 4);
Lane(haystack, needle, ref result, 3);
Lane(haystack, needle, ref result, 2);
Lane(haystack, needle, ref result, 1);
Lane(haystack, needle, ref result, 0);
return result;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void Lane (byte* haystack, byte needle, ref int result, int index) {
    if (haystack[index] == needle)
        result = index;
}

Additional notes

I've tried maybe a dozen different formulations/structures for this code and they all generate similar or worse code, none of them generate cmovs. I've verified that cmovs are generated in other parts of my code - this method even has one further up:

       movzx    r8, al
       mov      ecx, 255
       test     r8d, r8d
       cmovne   ecx, r8d

I've also tried reducing the number of Lane calls to see if that simplification would light up the cmov intrinsic but it didn't. I've also tried manually inlining Lane to no effect.

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 6 days ago

Do you have a self-contained example? For

public static unsafe int Foo(Vector128<byte> bucket, byte needle)
{
    var haystack = (byte*)Unsafe.AsPointer(ref bucket);
    var result = 32;
    Lane(haystack, needle, ref result, 13);
    Lane(haystack, needle, ref result, 12);
    Lane(haystack, needle, ref result, 11);
    Lane(haystack, needle, ref result, 10);
    Lane(haystack, needle, ref result, 9);
    Lane(haystack, needle, ref result, 8);
    Lane(haystack, needle, ref result, 7);
    Lane(haystack, needle, ref result, 6);
    Lane(haystack, needle, ref result, 5);
    Lane(haystack, needle, ref result, 4);
    Lane(haystack, needle, ref result, 3);
    Lane(haystack, needle, ref result, 2);
    Lane(haystack, needle, ref result, 1);
    Lane(haystack, needle, ref result, 0);
    return result;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Lane(byte* haystack, byte needle, ref int result, int index)
{
    if (haystack[index] == needle)
        result = index;
}

I get

; Method Program:Foo(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):int (FullOpts)
G_M56088_IG01:  ;; offset=0x0000
                        ;; size=0 bbWeight=1 PerfScore 0.00

G_M56088_IG02:  ;; offset=0x0000
       mov      rax, rcx
       mov      ecx, 32
       movzx    r8, byte  ptr [rax+0x0D]
       movzx    rdx, dl
       mov      r10d, 13
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0C]
       mov      r10d, 12
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0B]
       mov      r10d, 11
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0A]
       mov      r10d, 10
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x09]
       mov      r10d, 9
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x08]
       mov      r10d, 8
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x07]
       mov      r10d, 7
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x06]
       mov      r10d, 6
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x05]
       mov      r10d, 5
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x04]
       mov      r10d, 4
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x03]
       mov      r10d, 3
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x02]
       mov      r10d, 2
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x01]
       mov      r10d, 1
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    rax, byte  ptr [rax]
       xor      r8d, r8d
       cmp      eax, edx
       cmove    ecx, r8d
       mov      eax, ecx
                        ;; size=259 bbWeight=1 PerfScore 39.50

G_M56088_IG03:  ;; offset=0x0103
       ret      
                        ;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 260

on main.

Note that if your code is inside a loop then it is expected that we do not emit conditional selects, see https://github.com/dotnet/runtime/issues/96380#issuecomment-1873264530.

kg commented 6 days ago

It's due to a loop, thank you. We can close this if that limitation has a tracking issue.