dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

RyuJit: avoid conditional jumps using cmov and similar instructions #6749

Closed svick closed 1 year ago

svick commented 7 years ago

Conditional jumps, especially those that are hard to predict, are fairly expensive, so they should be avoided if possible. One way to avoid them is to use conditional moves and similar instructions (like sete). As far as I can tell, RuyJit never does this and I think it should.

For example, take these two methods:

[MethodImpl(MethodImplOptions.NoInlining)]
static long sete_or_mov(bool cond) {
    return cond ? 4 : 0;
}

[MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue) {
    long tmp1 = longValue & 0x00000000ffffffff;
    return tmp1 == 0 ? longValue : tmp1;
}

For both of them, RyuJit generates a conditional jump:

; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,   3  )    bool  ->  rcx
;  V01 tmp0         [V01,T01] (  3,   2  )     int  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M60330_IG01:

G_M60330_IG02:
       84C9                 test     cl, cl
       7504                 jne      SHORT G_M60330_IG03
       33C0                 xor      eax, eax
       EB05                 jmp      SHORT G_M60330_IG04

G_M60330_IG03:
       B804000000           mov      eax, 4

G_M60330_IG04:
       4863C0               movsxd   rax, eax

G_M60330_IG05:
       C3                   ret

; Total bytes of code 17, prolog size 0 for method Program:sete_or_mov(bool):long
; ============================================================
; Assembly listing for method Program:cmov(long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,   3.5)    long  ->  rcx
;  V01 loc0         [V01,T01] (  3,   2.5)    long  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M53075_IG01:

G_M53075_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C1               and      rax, rcx
       4885C0               test     rax, rax
       7401                 je       SHORT G_M53075_IG04

G_M53075_IG03:
       C3                   ret

G_M53075_IG04:
       488BC1               mov      rax, rcx

G_M53075_IG05:
       C3                   ret

; Total bytes of code 18, prolog size 0 for method Program:cmov(long):long
; ============================================================

For comparison, here are the same methods compiled using Clang and GCC with -O1 (by Compiler Explorer):

GCC 6.2:

sete_or_mov(bool):
        test    dil, dil
        setne   al
        movzx   eax, al
        sal     rax, 2
        ret
cmov(unsigned long):
        mov     eax, edi
        test    rax, rax
        cmove   rax, rdi
        ret

Clang 3.9.0:

sete_or_mov(bool):                       # @sete_or_mov(bool)
        movzx   eax, dil
        shl     rax, 2
        ret

cmov(unsigned long):                               # @cmov(unsigned long)
        mov     eax, edi
        mov     ecx, 4294967295
        and     rcx, rdi
        cmove   rax, rdi
        ret

category:cq theme:basic-cq skill-level:expert cost:large impact:small

JosephTremoulet commented 7 years ago

@tannergooding, translating "caller" and "M" to your example, that's saying that (if both are relaxed) a fault in MyMethod may result in some arbitrary subset of Select 's side-effects being suppressed. The fault in MyMethod still must be made visible.

EgorBo commented 4 years ago

HW_INTRINSIC-based implementation for CMOVnn: https://github.com/EgorBo/runtime-1/commit/1271fe536cc3274867a7306424d45c8db76be8ca

static int Test1(int x)
{
    return x == 42 ? 1000 : 2000;
}

static int Test2(int x, int a, int b)
{
    return x == 42 ? a : b;
}
; Method Tests:Test1(int):int
G_M56601_IG01:
G_M56601_IG02:
       83F92A               cmp      ecx, 42
       B8E8030000           mov      eax, 0x3E8
       BAD0070000           mov      edx, 0x7D0
       0F45C2               cmovne   eax, edx
G_M56601_IG03:
       C3                   ret      
; Total bytes of code: 17

; Method Tests:Test2(int,int,int):int
G_M50938_IG01:
G_M50938_IG02:
       83F92A               cmp      ecx, 42
       8BC2                 mov      eax, edx
       410F45C0             cmovne   eax, r8d
G_M50938_IG03:
       C3                   ret      
; Total bytes of code: 10

Works better with PGO (COMPlus_TieredPGO=1) 🙂: image