USRA is replaced with USHR+ORR which results in poor codegen

Quuxplusone commented 3 years ago


Bugzilla Link	PR49577
Status	NEW
Importance	P enhancement
Reported by	Danila Kutenin (kutdanila@yandex.ru)
Reported on	2021-03-13 05:12:19 -0800
Last modified on	2021-05-18 16:18:47 -0700
Version	trunk
Hardware	PC Linux
CC	arnaud.degrandmaison@arm.com, david.green@arm.com, llvm-bugs@lists.llvm.org, smithp352@googlemail.com, Ties.Stuij@arm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

int MoveMask(uint8x16_t input)
{
    uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
    uint32x4_t paired16 =
        vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7));
    uint64x2_t paired32 =
        vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14));
    uint8x16_t paired64 =
        vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28));
    return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8);
}

Generates for vsraq_n_u16 and vsraq_n_u32 USHR+ORR instead of USRA like GCC does

https://gcc.godbolt.org/z/MxP63x

Also in Match function there are two redundant AND with 0xff

Match(unsigned char):                              // @Match(unsigned char)
        adrp    x8, ctrl
        ldr     q0, [x8, :lo12:ctrl]
        dup     v1.16b, w0
        cmeq    v0.16b, v1.16b, v0.16b
        movi    v1.16b, #1
        and     v0.16b, v0.16b, v1.16b
        ushr    v1.8h, v0.8h, #7
        orr     v0.16b, v1.16b, v0.16b
        ushr    v1.4s, v0.4s, #14
        orr     v0.16b, v1.16b, v0.16b
        usra    v0.2d, v0.2d, #28
        umov    w8, v0.b[0]
        umov    w9, v0.b[8]
        and     x0, x8, #0xff // Not needed?
        and     x8, x9, #0xff // Not needed?
        bfi     x0, x8, #8, #8
        ret

Quuxplusone commented 3 years ago

Those instructions do indeed look redundant. I've put up https://reviews.llvm.org/D98599 to improve the lowering.

The ushr+orr I haven't looked at in any depth, but llvm is likely being helpful by turning the Add into an Or. That needs to be turned back using something like IsOrAdd or AddLikeOrOp used in other backends (although known bits may be needed for the aarch64 shifts to make that work).

Quuxplusone commented 3 years ago

Let me bump this once again, new zstd 1.5.0 lost performance with the new update because of this

https://github.com/facebook/zstd/pull/2494#issuecomment-829631487

I can try to fix on my own, just need some guidance probably

Quuxplusone commented 3 years ago

Okay, it is even more interesting

Removing first shrq_n_u8 turns the code into the good one

int MoveMask(uint8x16_t input) { uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(input, input, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8); }

        usra    v0.8h, v0.8h, #7
        usra    v0.4s, v0.4s, #14
        usra    v0.2d, v0.2d, #28
        umov    w0, v0.b[0]
        umov    w8, v0.b[8]
        bfi     w0, w8, #8, #8
        ret

And actually there are tests on that in AArch64/arm64-vsra.ll to check that. I believe it comes from DAGCombiner::visitAdd and it can prove that + is the same as or in

if ((!LegalOperations || TLI.isOperationLegal(ISD::OR, VT)) && DAG.haveNoCommonBitsSet(N0, N1)) return DAG.getNode(ISD::OR, DL, VT, N0, N1);

Yeah, we probably need something like AddLikeOrOp

Quuxplusone / LLVMBugzillaTest

USRA is replaced with USHR+ORR which results in poor codegen #48546