[AArch64] Unnecessary spills in fully unrolled loop (SLP Vectorizer does not kick in due to cost modelling)

Quuxplusone commented 4 years ago


Bugzilla Link	PR44593
Status	NEW
Importance	P enhancement
Reported by	Cesar Eduardo Barros (cesarb@cesarb.eti.br)
Reported on	2020-01-19 17:14:54 -0800
Last modified on	2021-05-25 01:16:20 -0700
Version	trunk
Hardware	PC Linux
CC	florian_hahn@apple.com, llvm-bugs@lists.llvm.org, sjoerd.meijer@arm.com
Fixed by commit(s)
Attachments	`slp-repor.ll` (10858 bytes, text/x-matlab)
Blocks
Blocked by
See also	PR42987

The following code has several unnecessary spills and reloads:

    #define LEN 32

    #include <stddef.h>
    #include <stdint.h>

    struct buf {
        uint8_t buf[LEN];
    };

    uint8_t fn(struct buf *a, struct buf *b) {
        uint8_t tmp = 0;
        for (size_t i = 0; i < LEN; i++)
            tmp |= a->buf[i] ^ b->buf[i];
        return tmp;
    }

(The original code, in Rust, is at
https://github.com/cesarb/constant_time_eq/blob/0.1.5/src/lib.rs)

The generated code on clang 9.0.0 (from godbolt.org with -O3 --target=aarch64-
unknown-linux-gnu) is:

fn:                                     // @fn
        sub     sp, sp, #128            // =128
        stp     x28, x27, [sp, #32]     // 16-byte Folded Spill
        stp     x26, x25, [sp, #48]     // 16-byte Folded Spill
        stp     x24, x23, [sp, #64]     // 16-byte Folded Spill
        stp     x22, x21, [sp, #80]     // 16-byte Folded Spill
        stp     x20, x19, [sp, #96]     // 16-byte Folded Spill
        stp     x29, x30, [sp, #112]    // 16-byte Folded Spill
        ldrb    w8, [x0]
        ldrb    w9, [x1]
        ldrb    w10, [x0, #1]
        ldrb    w11, [x1, #1]
        ldrb    w12, [x0, #2]
        ldrb    w13, [x1, #2]
        ldrb    w14, [x0, #3]
        ldrb    w15, [x1, #3]
        ldrb    w16, [x0, #4]
        ldrb    w17, [x1, #4]
        ldrb    w18, [x0, #5]
        ldrb    w2, [x1, #5]
        eor     w8, w9, w8
        eor     w9, w11, w10
        ldrb    w30, [x0, #14]
        stp     w9, w8, [sp, #24]       // 8-byte Folded Spill
        ldrb    w8, [x1, #14]
        ldrb    w10, [x0, #15]
        eor     w9, w13, w12
        ldrb    w12, [x1, #15]
        str     w9, [sp, #20]           // 4-byte Folded Spill
        eor     w9, w15, w14
        str     w9, [sp, #16]           // 4-byte Folded Spill
        eor     w9, w17, w16
        str     w9, [sp, #12]           // 4-byte Folded Spill
        eor     w9, w2, w18
        str     w9, [sp, #8]            // 4-byte Folded Spill
        eor     w17, w8, w30
        eor     w15, w12, w10
        ldrb    w11, [x0, #29]
        ldrb    w10, [x1, #29]
        ldrb    w9, [x0, #30]
        ldrb    w8, [x1, #30]
        ldrb    w3, [x0, #6]
        eor     w10, w10, w11
        ldrb    w4, [x1, #6]
        eor     w8, w8, w9
        ldp     w11, w9, [sp, #24]      // 8-byte Folded Reload
        ldrb    w5, [x0, #7]
        ldrb    w6, [x1, #7]
        ldrb    w7, [x0, #8]
        orr     w9, w11, w9
        ldr     w11, [sp, #20]          // 4-byte Folded Reload
        ldrb    w19, [x1, #8]
        ldrb    w20, [x0, #9]
        ldrb    w21, [x1, #9]
        orr     w9, w11, w9
        ldr     w11, [sp, #16]          // 4-byte Folded Reload
        ldrb    w22, [x0, #10]
        ldrb    w23, [x1, #10]
        eor     w2, w4, w3
        orr     w9, w11, w9
        ldr     w11, [sp, #12]          // 4-byte Folded Reload
        ldrb    w24, [x0, #11]
        ldrb    w25, [x1, #11]
        eor     w4, w6, w5
        orr     w9, w11, w9
        ldr     w11, [sp, #8]           // 4-byte Folded Reload
        ldrb    w26, [x0, #12]
        ldrb    w27, [x1, #12]
        eor     w6, w19, w7
        orr     w9, w11, w9
        orr     w9, w2, w9
        orr     w9, w4, w9
        ldrb    w28, [x0, #13]
        ldrb    w29, [x1, #13]
        eor     w19, w21, w20
        orr     w9, w6, w9
        eor     w21, w23, w22
        orr     w9, w19, w9
        ldrb    w14, [x0, #16]
        ldrb    w16, [x1, #16]
        eor     w23, w25, w24
        orr     w9, w21, w9
        ldrb    w18, [x0, #17]
        ldrb    w3, [x1, #17]
        ldrb    w5, [x0, #18]
        ldrb    w7, [x1, #18]
        eor     w25, w27, w26
        orr     w9, w23, w9
        eor     w27, w29, w28
        orr     w9, w25, w9
        orr     w9, w27, w9
        ldrb    w20, [x0, #19]
        ldrb    w22, [x1, #19]
        ldrb    w24, [x0, #20]
        ldrb    w26, [x1, #20]
        ldrb    w28, [x0, #21]
        ldrb    w29, [x1, #21]
        ldrb    w12, [x0, #22]
        eor     w14, w16, w14
        ldrb    w16, [x1, #22]
        orr     w9, w17, w9
        eor     w18, w3, w18
        ldrb    w3, [x0, #23]
        eor     w5, w7, w5
        ldrb    w7, [x1, #23]
        orr     w9, w15, w9
        orr     w9, w14, w9
        orr     w9, w18, w9
        eor     w20, w22, w20
        ldrb    w22, [x0, #24]
        eor     w24, w26, w24
        ldrb    w26, [x1, #24]
        eor     w28, w29, w28
        ldrb    w29, [x0, #25]
        eor     w13, w16, w12
        ldrb    w16, [x1, #25]
        orr     w9, w5, w9
        eor     w3, w7, w3
        ldrb    w7, [x0, #26]
        ldrb    w30, [x1, #26]
        orr     w9, w20, w9
        orr     w9, w24, w9
        orr     w9, w28, w9
        eor     w22, w26, w22
        eor     w16, w16, w29
        ldrb    w26, [x0, #27]
        ldrb    w29, [x1, #27]
        orr     w9, w13, w9
        eor     w7, w30, w7
        ldrb    w30, [x0, #28]
        ldrb    w12, [x1, #28]
        orr     w9, w3, w9
        orr     w9, w22, w9
        orr     w9, w16, w9
        eor     w26, w29, w26
        orr     w9, w7, w9
        ldrb    w11, [x0, #31]
        ldrb    w13, [x1, #31]
        eor     w12, w12, w30
        orr     w9, w26, w9
        orr     w9, w12, w9
        ldp     x29, x30, [sp, #112]    // 16-byte Folded Reload
        ldp     x20, x19, [sp, #96]     // 16-byte Folded Reload
        ldp     x22, x21, [sp, #80]     // 16-byte Folded Reload
        ldp     x24, x23, [sp, #64]     // 16-byte Folded Reload
        ldp     x26, x25, [sp, #48]     // 16-byte Folded Reload
        ldp     x28, x27, [sp, #32]     // 16-byte Folded Reload
        orr     w9, w10, w9
        orr     w8, w8, w9
        eor     w9, w13, w11
        orr     w0, w9, w8
        add     sp, sp, #128            // =128
        ret

It should be possible to avoid all these spills in the middle. The Rust code
also has a 16-element and a 64-element version; with 16 elements (#define LEN
16 in the C code) the spills do not happen, but only because there are enough
registers; with 64 elements (#define LEN 64 in the C code) the spills do not
happen because the code uses the NEON registers instead. GCC uses the NEON
registers on all three versions of the code.

The expected result would be to either vectorize also the 32-element and the 16-
element versions (like both GCC and x86-64 clang do), or at least reorder the
operations in the fully unrolled loop so there are no unnecessary spills.

Quuxplusone commented 4 years ago

This looks like a cost-modelling issue in the SLP vectoriser on AArch64. The
attached slp-repro.ll highlights the issue. On AArch64, the SLP vectoriser does
not kick in.

`bin/opt -slp-vectorizer -instcombine slp-repor.ll -S -mtriple=arm64-apple-
iphoneos | bin/llc -o - -mtriple=arm64-apple-iphoneos` -> gives terrible code.

Now if we pretend to do SLP vectorization for X86, we get very nice code:

`bin/opt -slp-vectorizer -instcombine ~/Desktop/slp-repor.ll -S -mtriple=x86_64-
apple-macos | bin/llc -o - -mtriple=arm64-apple-iphoneos` produces:

    ldp q0, q1, [x0]
    ldp q3, q2, [x1]
    eor v1.16b, v2.16b, v1.16b
    eor v0.16b, v3.16b, v0.16b
    orr v0.16b, v0.16b, v1.16b
    ext v1.16b, v0.16b, v0.16b, #8
    orr v0.16b, v0.16b, v1.16b
    ext v1.16b, v0.16b, v0.16b, #4
    orr v0.16b, v0.16b, v1.16b
    ext v1.16b, v0.16b, v0.16b, #2
    orr v0.16b, v0.16b, v1.16b
    dup v1.16b, v0.b[1]
    orr v0.16b, v0.16b, v1.16b
    umov    w0, v0.b[0]
    ret

Alternatively, we could avoid unrolling early and let the loop vectoriser deal
with vectorization (e.g. pass -fno-unroll-loops to clang and we get reasonable
code). The reason we vectorise with LEN = 64 is that we *do not* unroll before
vectorization, the vectoriser kicks in and the vectorized loops is unrolled
afterwards. There's already an issue discussing excessive unrolling
(https://bugs.llvm.org/show_bug.cgi?id=42987) and I think this issue is a good
example of hurtful unrolling.

Quuxplusone commented 4 years ago

Attached slp-repor.ll (10858 bytes, text/x-matlab): slp-repro.ll

Quuxplusone commented 3 years ago

Yep, looks like a cost-modeling issue. For example:

    SLP: Adding cost 349 for reduction that starts with   %xor11.15 = xor i8 %31, %30 (It is a splitting reduction)

which is a very high cost. We are looking into this now.

Quuxplusone / LLVMBugzillaTest

[AArch64] Unnecessary spills in fully unrolled loop (SLP Vectorizer does not kick in due to cost modelling) #43563