Open Quuxplusone opened 8 years ago
Bugzilla Link | PR28434 |
Status | NEW |
Importance | P normal |
Reported by | Adam Nowacki (nowak-llvm@tepeserwery.pl) |
Reported on | 2016-07-06 07:10:00 -0700 |
Last modified on | 2019-09-01 12:38:40 -0700 |
Version | 3.8 |
Hardware | PC All |
CC | efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | PR28461 |
See also | PR21356, PR28461 |
Looks like an x86 cost model problem. f2() vectorizes for PowerPC:
$ ./clang -O2 28434.c -S -o - -target powerpc64 -emit-llvm
; ModuleID = '28434.c'
source_filename = "28434.c"
target datalayout = "E-m:e-i64:64-n32:64"
target triple = "powerpc64"
; Function Attrs: norecurse nounwind
define void @f2(float* noalias nocapture align 16 %q, i32* noalias nocapture
readonly align 16 %p) local_unnamed_addr #0 {
entry:
%0 = bitcast i32* %p to <4 x i32>*
%1 = load <4 x i32>, <4 x i32>* %0, align 16, !tbaa !1
%2 = sitofp <4 x i32> %1 to <4 x float>
%3 = bitcast float* %q to <4 x float>*
store <4 x float> %2, <4 x float>* %3, align 16, !tbaa !5
ret void
}
attributes #0 = { norecurse nounwind "disable-tail-calls"="false" "less-precise-
fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false"
"stack-protector-buffer-size"="8" "target-cpu"="ppc64" "target-
features"="+altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-qpx,-
vsx" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.ident = !{!0}
!0 = !{!"clang version 3.9.0 (trunk 274633)"}
!1 = !{!2, !2, i64 0}
!2 = !{!"int", !3, i64 0}
!3 = !{!"omnipotent char", !4, i64 0}
!4 = !{!"Simple C/C++ TBAA"}
!5 = !{!6, !6, i64 0}
!6 = !{!"float", !3, i64 0}
See also http://reviews.llvm.org/D21530 .
Thanks, Eli.
I checked in the minimal cost model change in:
http://reviews.llvm.org/rL274658
...and that should fix cases f2() and f4().
I think f3() has some other problem; even with cost=1 for the sitofp:
28434.c:19:3: remark: loop not vectorized: vectorization is not beneficial and
is not explicitly forced [-Rpass-analysis=loop-vectorize]
Should we also, perhaps, start giving higher costs to scalar operations in
these cases?
E.g. we currently have:
"Found an estimated cost of 1 for VF 1 For instruction: %conv = sitofp i32 %0
to float"
Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
(In reply to comment #4)
> Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
That's a good question.
The scalar version of sitofp (i32 --> float) for x86 would be cvtsi2ss. This
was apparently 1.5 reciprocal throughput on SandyBridge, but 3 on Merom,
Nehalem, Haswell, Skylake? Seems unlikely that it magically improved for one
generation and then fell back, but I haven't tested it. But based on those
numbers, I think the conservative answer would be that the cost for the scalar
op would be 3.
This brings up the larger question of what exactly are we trying to model with
these numbers: latency, throughput, something else?
It also brings up the question from bug 21356, comment 5: how do we distinguish
between micro-arches that may vary wildly in their cost for a given op?
(In reply to comment #5)
> (In reply to comment #4)
> > Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
...
>
> This brings up the larger question of what exactly are we trying to model
> with these numbers: latency, throughput, something else?
In theory, the numbers represent throughput. The idea being that we handle
latency hiding with the interleaving factor (which is computed separately). The
"user" cost model used by the unroller, etc. is more subtle -- by which I mean
less well defined :(
Same issue with int to double conversion:
void f1(double *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
#pragma clang loop vectorize(enable)
for (unsigned int i = 0; i < 2; ++i) {
q[i] = p[i];
}
}
void f4(double *__restrict__ q __attribute__((align_value(16))), int
*__restrict__ p __attribute__((align_value(16)))) {
for (unsigned int i = 0; i < 2; ++i) {
q[i] = p[i];
}
}
f1(double*, int*): # @f1(double*, int*)
movq (%rsi), %xmm0 # xmm0 = mem[0],zero
cvtdq2pd %xmm0, %xmm0
movaps %xmm0, (%rdi)
f4(double*, int*): # @f4(double*, int*)
cvtsi2sdl (%rsi), %xmm0
movsd %xmm0, (%rdi)
xorps %xmm0, %xmm0
cvtsi2sdl 4(%rsi), %xmm0
movsd %xmm0, 8(%rdi)
Also vectorized version should probably skip register temporary:
cvtdq2pd (%rsi), %xmm0
movaps %xmm0, (%rdi)
double to int conversion:
void f1(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
#pragma clang loop unroll(disable)
#pragma clang loop vectorize(enable)
for (unsigned int i = 0; i < 2; ++i) {
q[i] = p[i];
}
}
void f4(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
for (unsigned int i = 0; i < 2; ++i) {
q[i] = p[i];
}
}
f1 produces horrible code:
f1(int*, double*): # @f1(int*, double*)
movapd (%rsi), %xmm0
cvttsd2si %xmm0, %rax
movd %rax, %xmm1
shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]
cvttsd2si %xmm0, %rax
movd %rax, %xmm0
punpcklqdq %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0]
pshufd $232, %xmm1, %xmm0 # xmm0 = xmm1[0,2,2,3]
movq %xmm0, (%rdi)
f4 is better but not vectorized:
f4(int*, double*): # @f4(int*, double*)
cvttsd2si (%rsi), %eax
movl %eax, (%rdi)
cvttsd2si 8(%rsi), %eax
movl %eax, 4(%rdi)
correctly vectorized should be:
cvttpd2dq (%rsi), %xmm0
movq %xmm0, (%rdi)
First, we need to be careful with loop bounds of 2 or 4. The loop vectorizer,
unless forced with a pragma, ignores all loops with trip count < 16. So if we
get auto-vectorization for low bounds, it's from unrolling + SLP vectorizer.
Regarding the "double to int" cases, this is purely an SSE issue. If the loop
bound is high enough, (>= 16), or with the pragma, we vectorize AVX and above.
For:
void f1(int *__restrict__ q __attribute__((align_value(16))), double
*__restrict__ p __attribute__((align_value(16)))) {
for (unsigned int i = 0; i < 16; ++i) {
q[i] = p[i];
}
}
With -mavx -fno-unroll-loops we get:
.LBB0_1: # %vector.body
vcvttpd2dqy (%rsi,%rax,8), %xmm0
vmovapd %xmm0, (%rdi,%rax,4)
addq $4, %rax
cmpq $16, %rax
jne .LBB0_1
For SSE4.2 and below, this is weirder:
1) We don't vectorize because the cost model considers the vector convert too
expensive. This is probably incorrect, and is probably, again, related to the
scalar convert being too cheap (cost of 1 for scalar convert, 6 for vector...)
2) In some sense the cost model is actually correct, because we don't generate
non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since
SSE2. An ISel issue?
(In reply to comment #8)
> 2) In some sense the cost model is actually correct, because we don't
> generate non-VEX cvttpd2dq even when the IR is vectorized, even though it
> exists since SSE2. An ISel issue?
Yes, probably something to do with type legalization, since <2 x i32> isn't
legal.
(In reply to comment #9)
> (In reply to comment #8)
> > 2) In some sense the cost model is actually correct, because we don't
> > generate non-VEX cvttpd2dq even when the IR is vectorized, even though it
> > exists since SSE2. An ISel issue?
>
> Yes, probably something to do with type legalization, since <2 x i32> isn't
> legal.
Right. This would probably require custom legalization of some sort.
Also, I should learn to read - the bug is specifically about short fixed-count
loops. For that case, to clarify what I wrote before - we will not vectorize if
we have, like in f3(), the combination of:
1) #pragma clang loop unroll (disable)
2) No #pragma clang loop vectorize (enable)
3) A loop count < 16.
This will not be caught by either the loop vectorizer (due to (2) + (3)) or the
SLP vectorizer (due to (1)).
That is more or less by design, although I'm not 100% sure this is really the
desired behavior.