`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks

neon-sunset commented 10 months ago

Description

It appears that for a straightforward compare -> increment, .NET emits a branch instead of cinc when a method containing such code is inlined inside a loop body.

Analysis

Given a method RuneLength:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
    var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
    if (lzcnt is 0) lzcnt++;

    return (int)lzcnt;
}

This compiles to:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0x0008
            ldrb    w0, [x0]
            lsl     w0, w0, #24
            mvn     w0, w0
            clz     w0, w0
            mov     w1, #1
            cmp     w0, #0
            csel    w0, w0, w1, ne ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero

G_M000_IG03:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 44

We can see that .NET emits mov + cmp + csel here. This isn't cinc but it's a good start.

However, if the method is inlined inside a loop body, the codegen is different. Consider the below method:

static int Iterate(ref byte ptr, ref byte end)
{
    var acc = 0;
    while (Unsafe.IsAddressLessThan(ref ptr, ref end))
    {
        var length = RuneLength(in ptr);
        acc += length;
        ptr = ref Unsafe.Add(ref ptr, length);
    }

    return acc;
}

Instead of the pattern above, the codegen changes to cbnz label and mov 1 which is more compact but is more expensive because it uses branch execution units which have lower throughput per cycle than csel which uses integer units instead (on modern ARM cores like Firestorm or Cortex-X1):

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0x0008
            mov     w2, wzr
            cmp     x0, x1
            bhs     G_M000_IG06
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]

G_M000_IG03:                ;; offset=0x0014
            ldrb    w3, [x0]
            lsl     w3, w3, #24
            mvn     w3, w3
            clz     w3, w3
            cbnz    w3, G_M000_IG05 ;; <-- Could have been cmp + cinc

G_M000_IG04:                ;; offset=0x0028
            mov     w3, #1

G_M000_IG05:                ;; offset=0x002C
            add     w2, w2, w3
            sxtw    x3, w3
            add     x0, x0, x3
            cmp     x0, x1
            blo     G_M000_IG03

G_M000_IG06:                ;; offset=0x0040
            mov     w0, w2

G_M000_IG07:                ;; offset=0x0044
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 76

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.1
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

Regression?

No

Happy New Year!

ghost commented 10 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Description It appears that for a straightforward compare -> increment, .NET emits a branch instead of `cinc` when a method containing such code is inlined. When the method is not inlined, it emits `mov #1`, `cmp` and `csel` which can also be improved. ### Analysis Given a method `RuneLength`: ```cs [MethodImpl(MethodImplOptions.AggressiveInlining)] public static int RuneLength(in byte value) { var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24)); if (lzcnt is 0) lzcnt++; return (int)lzcnt; } ``` This compiles to: ```asm G_M000_IG01: ;; offset=0x0000 stp fp, lr, [sp, #-0x10]! mov fp, sp G_M000_IG02: ;; offset=0x0008 ldrb w0, [x0] lsl w0, w0, #24 mvn w0, w0 clz w0, w0 mov w1, #1 cmp w0, #0 csel w0, w0, w1, ne ;; ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero G_M000_IG03: ;; offset=0x0024 ldp fp, lr, [sp], #0x10 ret lr ; Total bytes of code 44 ``` We can see that .NET emits `mov` + `cmp` + `csel` here. This isn't `cinc` but it's a good start. However, if the method is inlined, the codegen is different. Consider the below method: ```cs static int Iterate(ref byte ptr, ref byte end) { var acc = 0; while (Unsafe.IsAddressLessThan(ref ptr, ref end)) { var length = RuneLength(in ptr); acc += length; ptr = ref Unsafe.Add(ref ptr, length); } return acc; } ``` Instead of the pattern above, the codegen changes to `cbnz label` and `mov 1` which is more compact but is more expensive if the loop body has other branches or if the pattern is not predictable: ```asm G_M000_IG01: ;; offset=0x0000 stp fp, lr, [sp, #-0x10]! mov fp, sp G_M000_IG02: ;; offset=0x0008 mov w2, wzr cmp x0, x1 bhs G_M000_IG06 align [0 bytes for IG03] align [0 bytes] align [0 bytes] align [0 bytes] G_M000_IG03: ;; offset=0x0014 ldrb w3, [x0] lsl w3, w3, #24 mvn w3, w3 clz w3, w3 cbnz w3, G_M000_IG05 ;; <-- Could have been cmp + cinc G_M000_IG04: ;; offset=0x0028 mov w3, #1 G_M000_IG05: ;; offset=0x002C add w2, w2, w3 sxtw x3, w3 add x0, x0, x3 cmp x0, x1 blo G_M000_IG03 G_M000_IG06: ;; offset=0x0040 mov w0, w2 G_M000_IG07: ;; offset=0x0044 ldp fp, lr, [sp], #0x10 ret lr ; Total bytes of code 76 ``` ### Configuration ``` .NET SDK: Version: 8.0.100 Commit: 57efcf1350 Workload version: 8.0.100-manifests.71b9f198 Runtime Environment: OS Name: Mac OS X OS Version: 14.1 OS Platform: Darwin RID: osx-arm64 Base Path: /usr/local/share/dotnet/sdk/8.0.100/ ``` ### Regression? No Thanks!

Author:	neon-sunset
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

jakobbotsch commented 10 months ago

We do not emit conditional select instructions inside loops today because we do not have the necessary framework in place to make a proper prediction on whether that is profitable or not. Branches can (and do in real world examples) outperform conditional select instructions due to branch prediction. See https://github.com/dotnet/runtime/issues/55364#issuecomment-1092894196 for more details.

neon-sunset commented 10 months ago

This might be true for older ARM cores but not necessarily for the new ones. For example, Firestorm (2020) and newer cores on osx-arm64 have reciproc. TP for cinc of 0.333 - higher than for branches which are at 0.5 for not taken and just 1 for taken. Which means that in this instance it should be in theory unconditionally cheaper (at equal codegen size). https://dougallj.github.io/applecpu/measurements/firestorm/CINC_64.html

Cortex-X1 software optimization guide too describes throughput at 4 conditional selects/compares per cycle vs 2 branch instructions. https://documentation-service.arm.com/static/6238b0f88804d00769e9dee6

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

Blocks under conditionals span multiple (not trivially cheap) instructions
Loop body has only single taken or two not taken branches
Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

jakobbotsch commented 10 months ago

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

Blocks under conditionals span multiple (not trivially cheap) instructions

Loop body has only single taken or two not taken branches

Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Another important consideration is the dependency chain the conditional instruction is involved in and whether or not it is on the critical path of the loop. We do not currently have something in place to evaluate that.

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).

Note that @a74nh works for Arm and originally implemented if-conversion within RyuJIT. He also did some experiments in https://github.com/dotnet/performance/pull/3078.

neon-sunset commented 10 months ago

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (https://github.com/dotnet/runtime/issues/82106 was a 40% regression in a core span function from a single misplaced if-conversion).

That is true, thanks. In this issue there's also note regarding not emitting cmp, cinc over mov, cmp, csel. Do you want me to submit a separate issue or does this one suffice?

jakobbotsch commented 10 months ago

Feel free to open a separate issue for it. Hopefully I can take a look soon. We should be able to emit csinc since #91262, so I'm not sure what goes wrong here. Even the following does not produce csinc:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt + 1 : lzcnt;
}

yet this (incorrect) rewrite does:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt : lzcnt + 1;
}

I think it should be a simple fix.

neon-sunset commented 10 months ago

@jakobbotsch I have updated the issue to reflect the possible improvement in the loop vs separate one for the cinc/csel form. Thank you for looking into this!

jakobbotsch commented 3 months ago

Won't get to this one in 9.0.

dotnet / runtime