Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

short fixed count loops are not vectorized without #pragma hints #28433

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR28434
Status NEW
Importance P normal
Reported by Adam Nowacki (nowak-llvm@tepeserwery.pl)
Reported on 2016-07-06 07:10:00 -0700
Last modified on 2019-09-01 12:38:40 -0700
Version 3.8
Hardware PC All
CC efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by PR28461
See also PR21356, PR28461
see https://godbolt.org/g/bLIzA1

void f1(float *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
#pragma clang loop vectorize(enable)
  for (unsigned int i = 0; i < 4; ++i) {
    q[i] = p[i];
  }
}
void f2(float *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop vectorize(enable)
  for (unsigned int i = 0; i < 4; ++i) {
    q[i] = p[i];
  }
}
void f3(float *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
  for (unsigned int i = 0; i < 4; ++i) {
    q[i] = p[i];
  }
}
void f4(float *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
  for (unsigned int i = 0; i < 4; ++i) {
    q[i] = p[i];
  }
}
void f5(float *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
  for (unsigned int i = 0; i < 40; ++i) {
    q[i] = p[i];
  }
}

f1 is correctly vectorized when vectorization is forced and loop unrolling
disabled:
        cvtdq2ps        (%rsi), %xmm0
        movaps  %xmm0, (%rdi)

f2 is not correctly vectorized even when vectorization is forced because loop
is fully unrolled before vectorization?:
        cvtsi2ssl       (%rsi), %xmm0
        movss   %xmm0, (%rdi)
        xorps   %xmm0, %xmm0
        cvtsi2ssl       4(%rsi), %xmm0
        movss   %xmm0, 4(%rdi)
        xorps   %xmm0, %xmm0
        ...

f3 is not correctly vectorized when loop unrolling is disabled but
vectorization is not forced:
        xorl    %eax, %eax
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
        xorps   %xmm0, %xmm0
        cvtsi2ssl       (%rsi,%rax,4), %xmm0
        movss   %xmm0, (%rdi,%rax,4)
        incq    %rax
        cmpq    $4, %rax
        jne     .LBB2_1

f4 without any #pragma is not correctly vectorized but is unrolled and produces
same code as f2.

f5 without any #pragma is correctly vectorized and unrolled only when loop
count is ~40 or higher:
        cvtdq2ps        (%rsi), %xmm0
        movaps  %xmm0, (%rdi)
        cvtdq2ps        16(%rsi), %xmm0
        movaps  %xmm0, 16(%rdi)
        ...
Quuxplusone commented 8 years ago
Looks like an x86 cost model problem. f2() vectorizes for PowerPC:

$ ./clang -O2 28434.c -S -o - -target powerpc64 -emit-llvm
; ModuleID = '28434.c'
source_filename = "28434.c"
target datalayout = "E-m:e-i64:64-n32:64"
target triple = "powerpc64"

; Function Attrs: norecurse nounwind
define void @f2(float* noalias nocapture align 16 %q, i32* noalias nocapture
readonly align 16 %p) local_unnamed_addr #0 {
entry:
  %0 = bitcast i32* %p to <4 x i32>*
  %1 = load <4 x i32>, <4 x i32>* %0, align 16, !tbaa !1
  %2 = sitofp <4 x i32> %1 to <4 x float>
  %3 = bitcast float* %q to <4 x float>*
  store <4 x float> %2, <4 x float>* %3, align 16, !tbaa !5
  ret void
}

attributes #0 = { norecurse nounwind "disable-tail-calls"="false" "less-precise-
fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false"
"stack-protector-buffer-size"="8" "target-cpu"="ppc64" "target-
features"="+altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-qpx,-
vsx" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 3.9.0 (trunk 274633)"}
!1 = !{!2, !2, i64 0}
!2 = !{!"int", !3, i64 0}
!3 = !{!"omnipotent char", !4, i64 0}
!4 = !{!"Simple C/C++ TBAA"}
!5 = !{!6, !6, i64 0}
!6 = !{!"float", !3, i64 0}
Quuxplusone commented 8 years ago

See also http://reviews.llvm.org/D21530 .

Quuxplusone commented 8 years ago
Thanks, Eli.

I checked in the minimal cost model change in:
http://reviews.llvm.org/rL274658

...and that should fix cases f2() and f4().

I think f3() has some other problem; even with cost=1 for the sitofp:

28434.c:19:3: remark: loop not vectorized: vectorization is not beneficial and
is not explicitly forced [-Rpass-analysis=loop-vectorize]
Quuxplusone commented 8 years ago
Should we also, perhaps, start giving higher costs to scalar operations in
these cases?

E.g. we currently have:
"Found an estimated cost of 1 for VF 1 For instruction:   %conv = sitofp i32 %0
to float"

Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
Quuxplusone commented 8 years ago
(In reply to comment #4)
> Regardless of whether the vector op is 5 or 15, is the scalar really a 1?

That's a good question.

The scalar version of sitofp (i32 --> float) for x86 would be cvtsi2ss. This
was apparently 1.5 reciprocal throughput on SandyBridge, but 3 on Merom,
Nehalem, Haswell, Skylake? Seems unlikely that it magically improved for one
generation and then fell back, but I haven't tested it. But based on those
numbers, I think the conservative answer would be that the cost for the scalar
op would be 3.

This brings up the larger question of what exactly are we trying to model with
these numbers: latency, throughput, something else?

It also brings up the question from bug 21356, comment 5: how do we distinguish
between micro-arches that may vary wildly in their cost for a given op?
Quuxplusone commented 8 years ago
(In reply to comment #5)
> (In reply to comment #4)
> > Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
...
>
> This brings up the larger question of what exactly are we trying to model
> with these numbers: latency, throughput, something else?

In theory, the numbers represent throughput. The idea being that we handle
latency hiding with the interleaving factor (which is computed separately). The
"user" cost model used by the unroller, etc. is more subtle -- by which I mean
less well defined :(
Quuxplusone commented 8 years ago
Same issue with int to double conversion:
void f1(double *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
#pragma clang loop vectorize(enable)
  for (unsigned int i = 0; i < 2; ++i) {
    q[i] = p[i];
  }
}
void f4(double *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
  for (unsigned int i = 0; i < 2; ++i) {
    q[i] = p[i];
  }
}

f1(double*, int*):                              # @f1(double*, int*)
        movq    (%rsi), %xmm0           # xmm0 = mem[0],zero
        cvtdq2pd        %xmm0, %xmm0
        movaps  %xmm0, (%rdi)
f4(double*, int*):                              # @f4(double*, int*)
        cvtsi2sdl       (%rsi), %xmm0
        movsd   %xmm0, (%rdi)
        xorps   %xmm0, %xmm0
        cvtsi2sdl       4(%rsi), %xmm0
        movsd   %xmm0, 8(%rdi)

Also vectorized version should probably skip register temporary:
        cvtdq2pd        (%rsi), %xmm0
        movaps  %xmm0, (%rdi)

double to int conversion:
void f1(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
#pragma clang loop vectorize(enable)
  for (unsigned int i = 0; i < 2; ++i) {
    q[i] = p[i];
  }
}
void f4(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
  for (unsigned int i = 0; i < 2; ++i) {
    q[i] = p[i];
  }
}

f1 produces horrible code:
f1(int*, double*):                              # @f1(int*, double*)
        movapd  (%rsi), %xmm0
        cvttsd2si       %xmm0, %rax
        movd    %rax, %xmm1
        shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
        cvttsd2si       %xmm0, %rax
        movd    %rax, %xmm0
        punpcklqdq      %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0]
        pshufd  $232, %xmm1, %xmm0      # xmm0 = xmm1[0,2,2,3]
        movq    %xmm0, (%rdi)

f4 is better but not vectorized:
f4(int*, double*):                              # @f4(int*, double*)
        cvttsd2si       (%rsi), %eax
        movl    %eax, (%rdi)
        cvttsd2si       8(%rsi), %eax
        movl    %eax, 4(%rdi)

correctly vectorized should be:
        cvttpd2dq       (%rsi), %xmm0
        movq    %xmm0, (%rdi)
Quuxplusone commented 8 years ago
First, we need to be careful with loop bounds of 2 or 4. The loop vectorizer,
unless forced with a pragma, ignores all loops with trip count < 16. So if we
get auto-vectorization for low bounds, it's from unrolling + SLP vectorizer.

Regarding the "double to int" cases, this is purely an SSE issue. If the loop
bound is high enough, (>= 16), or with the pragma, we vectorize AVX and above.

For:
void f1(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
  for (unsigned int i = 0; i < 16; ++i) {
    q[i] = p[i];
  }
}

With -mavx -fno-unroll-loops we get:
.LBB0_1:                                # %vector.body
        vcvttpd2dqy     (%rsi,%rax,8), %xmm0
        vmovapd %xmm0, (%rdi,%rax,4)
        addq    $4, %rax
        cmpq    $16, %rax
        jne     .LBB0_1

For SSE4.2 and below, this is weirder:
1) We don't vectorize because the cost model considers the vector convert too
expensive. This is probably incorrect, and is probably, again, related to the
scalar convert being too cheap (cost of 1 for scalar convert, 6 for vector...)
2) In some sense the cost model is actually correct, because we don't generate
non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since
SSE2. An ISel issue?
Quuxplusone commented 8 years ago
(In reply to comment #8)
> 2) In some sense the cost model is actually correct, because we don't
> generate non-VEX cvttpd2dq even when the IR is vectorized, even though it
> exists since SSE2. An ISel issue?

Yes, probably something to do with type legalization, since <2 x i32> isn't
legal.
Quuxplusone commented 8 years ago
(In reply to comment #9)
> (In reply to comment #8)
> > 2) In some sense the cost model is actually correct, because we don't
> > generate non-VEX cvttpd2dq even when the IR is vectorized, even though it
> > exists since SSE2. An ISel issue?
>
> Yes, probably something to do with type legalization, since <2 x i32> isn't
> legal.

Right. This would probably require custom legalization of some sort.
Quuxplusone commented 8 years ago
Also, I should learn to read - the bug is specifically about short fixed-count
loops. For that case, to clarify what I wrote before - we will not vectorize if
we have, like in f3(), the combination of:
1) #pragma clang loop unroll (disable)
2) No #pragma clang loop vectorize (enable)
3) A loop count < 16.
This will not be caught by either the loop vectorizer (due to (2) + (3)) or the
SLP vectorizer (due to (1)).

That is more or less by design, although I'm not 100% sure this is really the
desired behavior.