Block layout of partition loop not ideal

a74nh commented 1 week ago

Consider:

    public static unsafe ulong PartitionScalar(ref uint* input, ref uint* left, ref uint* right, int length)
    {
      long i = 0;

      // Position within the output arrays.
      ulong index_left = 0;
      ulong index_right = 0;

      uint first = input[0];

      for(i = 0; i< length; i++)
      {
        if (input[i] < first)
        {
          left[index_left] = input[i];
          index_left++;
        }
        else
        {
          right[index_right] = input[i];
          index_right++;
        }
      }

      return index_right;
    }

All entries less than then input[0] go into left, everything else goes into right. This is the main routine inside a quicksort.

On Arm64 this runs at half the performance of the equivalent C++ version compiled by GCC. I suspect X86 has similar issues but I haven't tried yet.

The C# disassembly looks reasonable:

; Assembly listing for method sveincsharp.Partition:TestPartitionScalar(System.Span`1[uint],System.Span`1[uint],System.Span`1[uint],int):ulong (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x30]!
            mov     fp, sp
            stp     xzr, xzr, [fp, #0x20]
            str     xzr, [fp, #0x18]

G_M000_IG02:                ;; offset=0x0010
            stp     x2, x0, [fp, #0x20]
            str     x4, [fp, #0x18]
            mov     x1, xzr
            mov     x3, xzr
            ldr     w5, [x0]
            sxtw    x6, w6
            cmp     x6, #0
            ble     G_M000_IG08

G_M000_IG03:                ;; offset=0x0030
            align   [0 bytes for IG04]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]

G_M000_IG04:                ;; offset=0x0030
            ldr     w7, [x0]
            cmp     w7, w5
            blo     G_M000_IG06

G_M000_IG05:                ;; offset=0x003C
            str     w7, [x4, x3, LSL #2]
            add     x3, x3, #1
            b       G_M000_IG07

G_M000_IG06:                ;; offset=0x0048
            str     w7, [x2, x1, LSL #2]
            add     x1, x1, #1

G_M000_IG07:                ;; offset=0x0050
            add     x0, x0, #4
            sub     x6, x6, #1
            cbnz    x6, G_M000_IG04

G_M000_IG08:                ;; offset=0x005C
            mov     x0, x3

G_M000_IG09:                ;; offset=0x0060
            ldp     fp, lr, [sp], #0x30
            ret     lr

; Total bytes of code 104

However the C++ version is a little more cunning:

0000000000000ba0 <_Z15partitionScalarPjS_S_j>:
 ba0:   b9400007    ldr w7, [x0]
 ba4:   aa0003e6    mov x6, x0
 ba8:   2a0303e8    mov w8, w3
 bac:   34000283    cbz w3, bfc <_Z15partitionScalarPjS_S_j+0x5c>
 bb0:   2a0703e4    mov w4, w7
 bb4:   d2800000    mov x0, #0x0                    // #0
 bb8:   d2800005    mov x5, #0x0                    // #0
 bbc:   d2800003    mov x3, #0x0                    // #0
 bc0:   14000007    b   bdc <_Z15partitionScalarPjS_S_j+0x3c>

# "Then" case
 bc4:   91000463    add x3, x3, #0x1
 bc8:   b8257824    str w4, [x1, x5, lsl #2]
 bcc:   910004a5    add x5, x5, #0x1
 bd0:   eb08007f    cmp x3, x8
 bd4:   54000120    b.eq    bf8 <_Z15partitionScalarPjS_S_j+0x58>  // b.none

# Start of the loop
 bd8:   b86378c4    ldr w4, [x6, x3, lsl #2]
 bdc:   6b0400ff    cmp w7, w4
 be0:   54ffff28    b.hi    bc4 <_Z15partitionScalarPjS_S_j+0x24>  // b.pmore

# "Else" case
 be4:   91000463    add x3, x3, #0x1
 be8:   b8207844    str w4, [x2, x0, lsl #2]
 bec:   91000400    add x0, x0, #0x1
 bf0:   eb08007f    cmp x3, x8
 bf4:   54ffff21    b.ne    bd8 <_Z15partitionScalarPjS_S_j+0x38>  // b.any

 bf8:   d65f03c0    ret
 bfc:   d2800000    mov x0, #0x0                    // #0
 c00:   d65f03c0    ret

Due to the block ordering there are fewer branches.

In coreclr, there are always two jumps per loop iteration.

In gcc, there are zero to two jumps per loop iteration.

This difference in block order has a huge impact on performance, halving the performance of the coreclr version

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

kunalspathak commented 1 week ago

You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.

If you see below, from profile, we find out that BB07 is less likely and push it towards bottom, reducing the extra unconditional branch.

Diffs comparison of FullOpts vs. Tier 1.

As for the compactness/reuse portion of loop, that is something we can investigate further.

neon-sunset commented 1 week ago

when tiering disabled

Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.

Thanks

kunalspathak commented 1 week ago

@dotnet/jit-contrib

amanasifkhalid commented 1 week ago

Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.

If you have any examples handy, I'd like to take a look. We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon. Our block layout plans are quite reliant on profile data, though the JIT frontend has some tricks to determine which blocks are important in the absence of profile data (for example, see Compiler::optSetBlockWeights, though this phase needs quite a bit of work, too).

neon-sunset commented 1 week ago

If you have any examples handy, I'd like to take a look.

Don't have anything on hand right now but will keep an eye out next time I open Ghidra 🙂. Some of them look like this one: https://github.com/dotnet/runtime/issues/93536 except torn loops or "why does this even have jump threading" cases appear under different conditions now. In any case, thanks for looking into this.

a74nh commented 1 week ago

You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.

This is good to see - I didn't realise profile data was being used this way.

However, this shouldn't need profile data to optimise.

We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon.

Ok, great. If I see anything else, I'll make sure to raise issues or comment here.

dotnet / runtime

Block layout of partition loop not ideal #108794