Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Pointless loop unroll / vectorization #37253

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR38280
Status NEW
Importance P enhancement
Reported by Fabian Giesen (fabian.giesen@epicgames.com)
Reported on 2018-07-23 14:15:50 -0700
Last modified on 2019-10-05 01:50:20 -0700
Version 6.0
Hardware PC Windows NT
CC david.bolvansky@gmail.com, david.green@arm.com, florian_hahn@apple.com, hfinkel@anl.gov, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by PR40878
See also
Example C++ code for x86, simplified from a more complex use case:

// ---- begin

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

// neg_offs <= -8 required
void apply_delta(uint8_t *dst, const uint8_t *src, ptrdiff_t neg_offs, size_t
count)
{
    // Just provided for context
    while (count >= 8)
    {
        __m128i src_bytes = _mm_loadl_epi64((const __m128i *) src);
        __m128i pred_bytes = _mm_loadl_epi64((const __m128i *) (dst + neg_offs));
        __m128i sum = _mm_add_epi8(src_bytes, pred_bytes);
        _mm_storel_epi64((__m128i *) dst, sum);

        dst += 8;
        src += 8;
        count -= 8;
    }

    // This is the loop in question
    while (count--)
    {
        *dst = *src + dst[neg_offs];
        dst++;
        src++;
    }
}

// ---- end

The bottom (tail) loop gets expanded into a giant monstrosity that attempts to
process 64 bytes at once, with various special-case paths for tail processing,
to handle cases where neg_offs > -64 (which means the obvious 64-elements-at-a-
time loop would not work), etc.

The full code can be viewed at https://godbolt.org/g/yRThcs, I won't post it
here. :)

All of which is completely pointless because the tail loop will (as is easy to
see) only ever see count <= 7.

This is an extreme example, but I'm seeing this general pattern (a scalar tail
loop for a manually vectorized loop getting pointlessly auto-vectorized) a lot.
Quuxplusone commented 6 years ago
On a perhaps related note, the very simple example
(https://godbolt.org/g/MToVLm):
unsigned func(unsigned count)
{
    while (count >= 32)
        count -= 32;
    return count;
}

Currently produces something like (thanks to indvarsimplify):

define i32 @test(i32 %size) {
entry:
  %0 = xor i32 %size, -1
  %1 = icmp ugt i32 %0, -32
  %umax = select i1 %1, i32 %0, i32 -32
  %2 = add i32 %umax, %size
  %3 = add i32 %2, 32
  %4 = and i32 %3, -32
  %5 = sub i32 %size, %4
  ret i32 %5
}

Which is really just:
https://rise4fun.com/Alive/lKL

define i32 @test(i32 %size) {
  %0 = and i32 %size, 31
  ret i32 %0
}
Quuxplusone commented 5 years ago

Maybe instcombine could fold it to just AND?