Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

256bit AVX vectorization sometimes forgets about 128bit packed instructions #28456

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR28457
Status NEW
Importance P normal
Reported by Adam Nowacki (nowak-llvm@tepeserwery.pl)
Reported on 2016-07-07 11:44:13 -0700
Last modified on 2019-10-02 10:08:01 -0700
Version trunk
Hardware PC All
CC a.bataev@hotmail.com, craig.topper@gmail.com, efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, v.porpodas@gmail.com, Vasileios.porpodas@intel.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Compiled with: -O3 -ffast-math -Rpass=loop- -fslp-vectorize-aggressive -msse -
msse2 -msse3 -mssse3 -msse4 -msse4.1 -mavx

f1 is correctly vectorized with both 256 and 128 bit packed instructions.
f2 is correctly vectorized with 256 bit packed instructions but remaining 4
iterations are handled with scalar instructions instead of 128 bit packed.

void f1(float *__restrict__ q __attribute__((align_value(32))), float
*__restrict__ p __attribute__((align_value(32)))) {
  for (unsigned int i = 0; i < (8 * 1 + 4); ++i) {
// remark: completely unrolled loop with 12 iterations
    q[i] = p[i] + 1.0f;
  }
}
void f2(float *__restrict__ q __attribute__((align_value(32))), float
*__restrict__ p __attribute__((align_value(32)))) {
  for (unsigned int i = 0; i < (8 * 5 + 4); ++i) {
// remark: vectorized loop (vectorization width: 8, interleaved count: 1)
// remark: completely unrolled loop with 5 iterations
// remark: completely unrolled loop with 4 iterations
    q[i] = p[i] + 1.0f;
  }
}

f1(float*, float*):                              # @f1(float*, float*)
        vmovaps (%rsi), %ymm0
        vaddps  .LCPI0_0(%rip), %ymm0, %ymm0
        vmovaps %ymm0, (%rdi)
        vmovaps 32(%rsi), %xmm0
        vaddps  .LCPI0_1(%rip), %xmm0, %xmm0
        vmovaps %xmm0, 32(%rdi)
        vzeroupper
        retq

f2(float*, float*):                              # @f2(float*, float*)
        vmovaps .LCPI1_0(%rip), %ymm0   # ymm0 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
        vaddps  (%rsi), %ymm0, %ymm1
        vmovaps %ymm1, (%rdi)
        vaddps  32(%rsi), %ymm0, %ymm1
        vmovaps %ymm1, 32(%rdi)
        vaddps  64(%rsi), %ymm0, %ymm1
        vmovaps %ymm1, 64(%rdi)
        vaddps  96(%rsi), %ymm0, %ymm1
        vmovaps %ymm1, 96(%rdi)
        vaddps  128(%rsi), %ymm0, %ymm0
        vmovaps %ymm0, 128(%rdi)
        vmovss  .LCPI1_1(%rip), %xmm0   # xmm0 = mem[0],zero,zero,zero
        vaddss  160(%rsi), %xmm0, %xmm1
        vmovss  %xmm1, 160(%rdi)
        vaddss  164(%rsi), %xmm0, %xmm1
        vmovss  %xmm1, 164(%rdi)
        vaddss  168(%rsi), %xmm0, %xmm1
        vmovss  %xmm1, 168(%rdi)
        vaddss  172(%rsi), %xmm0, %xmm0
        vmovss  %xmm0, 172(%rdi)
        vzeroupper
Quuxplusone commented 8 years ago

Probably some sort of pass ordering problem: the loop vectorizer splits off the last four iterations into a separate loop, then the loop unroller runs, and the SLP vectorizer doesn't run again after that.

Quuxplusone commented 5 years ago

Current codegen: https://godbolt.org/z/b-d1LX (still as bad as reported)

Quuxplusone commented 5 years ago
At least the first example is purely an SLP shortcoming. Test added here:
https://reviews.llvm.org/rL373483

This might improve with:
https://reviews.llvm.org/D57059
...but I haven't checked it.

And I'm not sure what the current status is, but that example likely should be
changed if we start preferring 128-bit vector ops by default. This is the kind
of 1-off ymm op that can cause frequency throttling and reduce perf overall.
Quuxplusone commented 5 years ago
(In reply to Sanjay Patel from comment #3)
> At least the first example is purely an SLP shortcoming. Test added here:

Sorry - I didn't notice that in the original description, the 1st example was
vectorizing as expected with the additional flag:
-fslp-vectorize-aggressive

That changed somewhere between clang 4.0 and clang 5.0 (we dropped the
BBVectorizer pass at that point?).