uchar4 vector element-by-element LUT handled worse than 3.4.

Quuxplusone commented 10 years ago


Bugzilla Link	PR20006
Status	NEW
Importance	P normal
Reported by	Simon Hosie (simon.hosie@arm.com)
Reported on	2014-06-11 15:05:13 -0700
Last modified on	2018-10-25 20:12:07 -0700
Version	trunk
Hardware	PC Linux
CC	llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk, srhines@google.com
Fixed by commit(s)
Attachments	`isolate.c` (438 bytes, text/x-csrc)
Blocks
Blocked by
See also

Created attachment 12640
Function demonstrating regression.

Given a simple LUT loop which operates independently on each element of a
uchar4, trunk presses ahead with a bunch of inserts and extracts through a
temporary vector, along with a flurry of type conversion operations.

In contrast, Clang 3.4 appears to abandon the vector pretense at the outset and
takes the scalar values directly from the source pointer -- completing the work
in half the time.

GCC 4.8 appears to behave like Clang 3.4, but additionally packs the output
into a single 32-bit scalar register before writing.

This affects both amd64 and ARM.  Simply -Ofast to reproduce on amd64, and
additionally -mfpu=neon for ARM to ensure that SIMD operations are available.

Quuxplusone commented 10 years ago

Attached isolate.c (438 bytes, text/x-csrc): Function demonstrating regression.

Quuxplusone commented 8 years ago

This appears to be fixed for 32-bit ARM, but this code (copied from attachment):

    typedef uint8_t uchar4 __attribute__((__vector_size__(4)));

    void dut0(uchar4 * restrict out, uchar4 const * restrict in, int count, uint8_t const * restrict tab) {
      uint8_t const *t0 = tab, *t1 = tab + 256, *t2 = tab + 512, *t3 = tab + 768;
      while (--count >= 0) {
        uchar4 tmp = *in++;
        *out++ = (uchar4){ t0[tmp[0]], t1[tmp[1]], t2[tmp[2]], t3[tmp[3]] };
      }
    }

Compiled:

    clang --target=aarch64-linux-gnu -Ofast -S foo.c -o-

Gives this loop body:

        ldrb    w12, [x1]
        ldrb    w13, [x1, #1]
        ldrb    w14, [x1, #2]
        ldrb    w15, [x1, #3]
        ins     v0.h[0], w12
        ins     v0.h[1], w13
        ins     v0.h[2], w14
        ins     v0.h[3], w15
        umov    w12, v0.h[0]
        umov    w13, v0.h[1]
        umov    w14, v0.h[2]
        umov    w15, v0.h[3]
        and     x12, x12, #0xff
        and     x13, x13, #0xff
        and     x14, x14, #0xff
        and     x15, x15, #0xff
        ldrb    w15, [x10, x15]
        ldrb    w14, [x9, x14]
        ldrb    w13, [x8, x13]
        ldrb    w12, [x3, x12]
        add     x1, x1, #4              // =4
        strb    w15, [x0, #3]
        strb    w14, [x0, #2]
        strb    w13, [x0, #1]
        strb    w12, [x0], #4

Twelve of those instructions could simply be deleted.

Quuxplusone / LLVMBugzillaTest

uchar4 vector element-by-element LUT handled worse than 3.4. #20005