Open Quuxplusone opened 7 years ago
Bugzilla Link | PR34724 |
Status | NEW |
Importance | P normal |
Reported by | Douglas Yung (douglas_yung@playstation.sony.com) |
Reported on | 2017-09-25 17:06:58 -0700 |
Last modified on | 2020-07-14 14:26:32 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | a.bataev@hotmail.com, efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, zvirack@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR34725, PR34730, PR34716, PR41813, PR45747 |
The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.
For reference: https://reviews.llvm.org/rL310260 (SLP)
And here's the IR that we currently (r314201) produce:
$ ./clang -O2 34724.c -S -o - -mavx -emit-llvm
define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
entry:
%0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4>
%1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5>
%2 = fadd <2 x float> %0, %1
%3 = extractelement <2 x float> %2, i32 0
%vecinit5 = insertelement <4 x float> undef, float %3, i32 1
%4 = extractelement <2 x float> %2, i32 1
%vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2
%vecext10 = extractelement <4 x float> %b, i32 2
%vecext11 = extractelement <4 x float> %b, i32 3
%add12 = fadd float %vecext10, %vecext11
%vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3
ret <4 x float> %vecinit13
}
There's a related theme here and bug 34730 and bug 34716:
%ext = extractelement <2 x float> %x, i32 0
%ins = insertelement <4 x float> undef, float %ext, i32 1
-->
%shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>
Can we canonicalize to shuffles more aggressively in instcombine now? If we can
recognize insert_subvector, extract_subvector, identity (this is already true I
think), single-element-repositioning (better name for this example?) as special
case masks, it should reduce complexity in the backend.
(In reply to Sanjay Patel from comment #1)
> The IR for this example is identical to bug 34725, so if we view this as a
> backend problem, one of these would be a duplicate.
That was wrong - the instructions are similar, but the element indexes for the
shuffles and insert/extract are not the same.
We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.
Looking at this report again now, and nothing's changed since the last
comments. Here's a better view of the code going into SLP:
define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
%b1 = extractelement <4 x float> %b, i32 1
%b2 = extractelement <4 x float> %b, i32 2
%b3 = extractelement <4 x float> %b, i32 3
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3
%v1 = insertelement <4 x float> undef, float %a23, i32 1
%v2 = insertelement <4 x float> %v1, float %b01, i32 2
%v3 = insertelement <4 x float> %v2, float %b23, i32 3
ret <4 x float> %v3
}
------------------------------------------------------------------------------
There are multiple potential problems contained in this example:
1. The cost model calculation claims that the extract/insert changes + <2 x
float> math op are profitable. This may need adjustment either in the generic
formulation or x86 cost model values.
2. SLP chooses to round down to vectorize to <2 x float> rather than rounding
up a potential <3 x float> op to <4 x float>. This is a limitation in
SLPVectorizerPass::tryToVectorizeList().
3. SLP chooses to vectorize the more complicated pair of scalar ops rather than
the simpler pair. Ie, we could vectorize this pair with elements all from %b:
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3
instead of this pair:
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1
4. As noted in comment 3, instcombine should be able to turn more
extract+insert sequences into shuffles. We should probably adjust that
independent of this bug, but I don't think it will change the codegen outcome
on this example.
If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR.
Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise.
I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.
Something else that is odd is that the codegen is incredibly different
depending upon the element you shuffle as undefined:
https://gcc.godbolt.org/z/e-6XUz
Many are generating the haddps and a load of extra code as well!
__m128 add_ps_u123(__m128 a, __m128 b) {
__m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}
__m128 add_ps_0u22(__m128 a, __m128 b) {
__m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}
__m128 add_ps_01u3(__m128 a, __m128 b) {
__m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
return __builtin_shufflevector(r, a, 0, 1, -1, 3);
}
__m128 add_ps_012u(__m128 a, __m128 b) {
__m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
return __builtin_shufflevector(r, a, 0, 1, 2, -1);
}
_Z11add_ps_u123Dv4_fS_: # @_Z11add_ps_u123Dv4_fS_
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vinsertps $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero
vinsertps $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero
vaddps %xmm0, %xmm2, %xmm0
vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
vpermilps $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
vaddss %xmm1, %xmm2, %xmm1
vpermilps $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3]
vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
retq
_Z11add_ps_0u22Dv4_fS_: # @_Z11add_ps_0u22Dv4_fS_
vhaddps %xmm1, %xmm0, %xmm0
retq
_Z11add_ps_01u3Dv4_fS_: # @_Z11add_ps_01u3Dv4_fS_
vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
vpermilps $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
vhaddps %xmm0, %xmm0, %xmm0
vaddss %xmm1, %xmm2, %xmm1
vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
retq
_Z11add_ps_012uDv4_fS_: # @_Z11add_ps_012uDv4_fS_
vmovshdup %xmm1, %xmm2 # xmm2 = xmm1[1,1,3,3]
vhaddps %xmm0, %xmm0, %xmm0
vaddss %xmm2, %xmm1, %xmm1
vinsertps $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
retq
(In reply to Sanjay Patel from comment #6)
> Looking at this report again now, and nothing's changed since the last
> comments. Here's a better view of the code going into SLP:
>
> define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
> %a0 = extractelement <4 x float> %a, i32 0
> %a1 = extractelement <4 x float> %a, i32 1
> %a2 = extractelement <4 x float> %a, i32 2
> %a3 = extractelement <4 x float> %a, i32 3
>
> %b0 = extractelement <4 x float> %b, i32 0
> %b1 = extractelement <4 x float> %b, i32 1
> %b2 = extractelement <4 x float> %b, i32 2
> %b3 = extractelement <4 x float> %b, i32 3
>
> %a23 = fadd float %a2, %a3
> %b01 = fadd float %b0, %b1
> %b23 = fadd float %b2, %b3
>
> %v1 = insertelement <4 x float> undef, float %a23, i32 1
> %v2 = insertelement <4 x float> %v1, float %b01, i32 2
> %v3 = insertelement <4 x float> %v2, float %b23, i32 3
> ret <4 x float> %v3
> }
>
> -----------------------------------------------------------------------------
> -
>
> There are multiple potential problems contained in this example:
>
> 1. The cost model calculation claims that the extract/insert changes + <2 x
> float> math op are profitable. This may need adjustment either in the
> generic formulation or x86 cost model values.
>
> 2. SLP chooses to round down to vectorize to <2 x float> rather than
> rounding up a potential <3 x float> op to <4 x float>. This is a limitation
> in SLPVectorizerPass::tryToVectorizeList().
>
> 3. SLP chooses to vectorize the more complicated pair of scalar ops rather
> than the simpler pair. Ie, we could vectorize this pair with elements all
> from %b:
> %b01 = fadd float %b0, %b1
> %b23 = fadd float %b2, %b3
>
> instead of this pair:
> %a23 = fadd float %a2, %a3
> %b01 = fadd float %b0, %b1
>
> 4. As noted in comment 3, instcombine should be able to turn more
> extract+insert sequences into shuffles. We should probably adjust that
> independent of this bug, but I don't think it will change the codegen
> outcome on this example.
I have a very preliminary patch for non-pow2 support in SLP vectorizer, which
will extend <3 x float> to <4 x float>. It is too optimistic in terms of the
cost of the vectorization, but generally speacking looks good. Hope fix all the
problems very soon.
(In reply to Alexey Bataev from comment #9)
> I have a very preliminary patch for non-pow2 support in SLP vectorizer,
> which will extend <3 x float> to <4 x float>. It is too optimistic in terms
> of the cost of the vectorization, but generally speacking looks good. Hope
> fix all the problems very soon.
Great! If we get a <4 x float> math op from SLP, then that should be no problem
for the backend to match as a horizontal op (if that's good for the target).
I'm still trying to make instcombine produce code that SLP can handle more
easily here:
https://reviews.llvm.org/D53507
@spatel Do you think this is a candidate for the VectorCombine pass?
(In reply to Simon Pilgrim from comment #11)
> @spatel Do you think this is a candidate for the VectorCombine pass?
We already get the extracelement transforms in -vector-combine that I would
expect given the IR in comment 6:
%0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%1 = fadd <4 x float> %0, %a
%2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%3 = fadd <4 x float> %2, %b
%4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef>
%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
ret <4 x float> %7
------------------------------------------------------------------------------
So SLP is out of the picture currently. The given IR does result in a
horizontal add, but we don't have a way to combine these into a single hadd yet:
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vaddps %xmm0, %xmm2, %xmm0
vhaddps %xmm1, %xmm1, %xmm2
vshufps $200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3]
vmovshdup %xmm1, %xmm2 ## xmm2 = xmm1[1,1,3,3]
vaddps %xmm1, %xmm2, %xmm1
vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
A possible transform for vector-combine that will help:
%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and
that then gets pushed back over to element 3 in the 2nd shuffle. We want to
adjust the 1st shuffle so that we're calculating the binop in element 3 because
that creates a select ("blend") shuffle instead:
%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>
Small hack proposal:
https://reviews.llvm.org/D76623
Further backend improvements: https://reviews.llvm.org/D83789