Compiler no longer optimizing some horizontal add/sub patterns after r310260

Quuxplusone commented 7 years ago


Bugzilla Link	PR34724
Status	NEW
Importance	P normal
Reported by	Douglas Yung (douglas_yung@playstation.sony.com)
Reported on	2017-09-25 17:06:58 -0700
Last modified on	2020-07-14 14:26:32 -0700
Version	trunk
Hardware	PC Linux
CC	a.bataev@hotmail.com, efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, zvirack@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR34725, PR34730, PR34716, PR41813, PR45747

Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns
to horizontal add/sub instructions, but no longer seems to do so.

Consider this example:

__attribute__((noinline))
__m128 add_ps_001(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to
r310260, the compiler would generate the following assembly:

vmovaps (%rcx), %xmm0
vhaddps (%rdx), %xmm0, %xmm0
retq

After r310260, the compiler is now generating the following code for the same
function:

vmovaps (%rdx), %xmm0
vmovddup        8(%rcx), %xmm1  # xmm1 = mem[0,0]
vinsertps       $28, %xmm0, %xmm1, %xmm2 # xmm2 = xmm1[0],xmm0[0],zero,zero
vinsertps       $76, %xmm1, %xmm0, %xmm1 # xmm1 = xmm1[1],xmm0[1],zero,zero
vaddps  %xmm1, %xmm2, %xmm1
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps       $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss  %xmm0, %xmm2, %xmm0
vpermilps       $208, %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,3]
vinsertps       $48, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1,2],xmm0[0]
retq

Quuxplusone commented 7 years ago

The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.

Quuxplusone commented 7 years ago

For reference: https://reviews.llvm.org/rL310260 (SLP)

And here's the IR that we currently (r314201) produce:

$ ./clang -O2 34724.c -S -o - -mavx -emit-llvm

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
entry:
  %0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4>
  %1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5>
  %2 = fadd <2 x float> %0, %1
  %3 = extractelement <2 x float> %2, i32 0
  %vecinit5 = insertelement <4 x float> undef, float %3, i32 1
  %4 = extractelement <2 x float> %2, i32 1
  %vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2
  %vecext10 = extractelement <4 x float> %b, i32 2
  %vecext11 = extractelement <4 x float> %b, i32 3
  %add12 = fadd float %vecext10, %vecext11
  %vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3
  ret <4 x float> %vecinit13
}

Quuxplusone commented 7 years ago

There's a related theme here and bug 34730 and bug 34716:

  %ext = extractelement <2 x float> %x, i32 0
  %ins = insertelement <4 x float> undef, float %ext, i32 1

-->

  %shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>

Can we canonicalize to shuffles more aggressively in instcombine now? If we can
recognize insert_subvector, extract_subvector, identity (this is already true I
think), single-element-repositioning (better name for this example?) as special
case masks, it should reduce complexity in the backend.

Quuxplusone commented 7 years ago

(In reply to Sanjay Patel from comment #1)
> The IR for this example is identical to bug 34725, so if we view this as a
> backend problem, one of these would be a duplicate.

That was wrong - the instructions are similar, but the element indexes for the
shuffles and insert/extract are not the same.

Quuxplusone commented 7 years ago

We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.

Quuxplusone commented 6 years ago

Looking at this report again now, and nothing's changed since the last
comments. Here's a better view of the code going into SLP:

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
  %a0 = extractelement <4 x float> %a, i32 0
  %a1 = extractelement <4 x float> %a, i32 1
  %a2 = extractelement <4 x float> %a, i32 2
  %a3 = extractelement <4 x float> %a, i32 3

  %b0 = extractelement <4 x float> %b, i32 0
  %b1 = extractelement <4 x float> %b, i32 1
  %b2 = extractelement <4 x float> %b, i32 2
  %b3 = extractelement <4 x float> %b, i32 3

  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

  %v1 = insertelement <4 x float> undef, float %a23, i32 1
  %v2 = insertelement <4 x float> %v1, float %b01, i32 2
  %v3 = insertelement <4 x float> %v2, float %b23, i32 3
  ret <4 x float> %v3
}

------------------------------------------------------------------------------

There are multiple potential problems contained in this example:

1. The cost model calculation claims that the extract/insert changes + <2 x
float> math op are profitable. This may need adjustment either in the generic
formulation or x86 cost model values.

2. SLP chooses to round down to vectorize to <2 x float> rather than rounding
up a potential <3 x float> op to <4 x float>. This is a limitation in
SLPVectorizerPass::tryToVectorizeList().

3. SLP chooses to vectorize the more complicated pair of scalar ops rather than
the simpler pair. Ie, we could vectorize this pair with elements all from %b:
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

instead of this pair:
  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1

4. As noted in comment 3, instcombine should be able to turn more
extract+insert sequences into shuffles. We should probably adjust that
independent of this bug, but I don't think it will change the codegen outcome
on this example.

Quuxplusone commented 6 years ago

If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR.

Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise.

I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.

Quuxplusone commented 6 years ago

Something else that is odd is that the codegen is incredibly different
depending upon the element you shuffle as undefined:

https://gcc.godbolt.org/z/e-6XUz

Many are generating the haddps and a load of extra code as well!

__m128 add_ps_u123(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

__m128 add_ps_0u22(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

__m128 add_ps_01u3(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, -1, 3);
}

__m128 add_ps_012u(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, 2, -1);
}

_Z11add_ps_u123Dv4_fS_:                 # @_Z11add_ps_u123Dv4_fS_
        vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
        vinsertps       $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero
        vinsertps       $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero
        vaddps  %xmm0, %xmm2, %xmm0
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vaddss  %xmm1, %xmm2, %xmm1
        vpermilps       $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3]
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_0u22Dv4_fS_:                 # @_Z11add_ps_0u22Dv4_fS_
        vhaddps %xmm1, %xmm0, %xmm0
        retq
_Z11add_ps_01u3Dv4_fS_:                 # @_Z11add_ps_01u3Dv4_fS_
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm1, %xmm2, %xmm1
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_012uDv4_fS_:                 # @_Z11add_ps_012uDv4_fS_
        vmovshdup       %xmm1, %xmm2    # xmm2 = xmm1[1,1,3,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm2, %xmm1, %xmm1
        vinsertps       $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
        retq

Quuxplusone commented 5 years ago

(In reply to Sanjay Patel from comment #6)
> Looking at this report again now, and nothing's changed since the last
> comments. Here's a better view of the code going into SLP:
>
> define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
>   %a0 = extractelement <4 x float> %a, i32 0
>   %a1 = extractelement <4 x float> %a, i32 1
>   %a2 = extractelement <4 x float> %a, i32 2
>   %a3 = extractelement <4 x float> %a, i32 3
>
>   %b0 = extractelement <4 x float> %b, i32 0
>   %b1 = extractelement <4 x float> %b, i32 1
>   %b2 = extractelement <4 x float> %b, i32 2
>   %b3 = extractelement <4 x float> %b, i32 3
>
>   %a23 = fadd float %a2, %a3
>   %b01 = fadd float %b0, %b1
>   %b23 = fadd float %b2, %b3
>
>   %v1 = insertelement <4 x float> undef, float %a23, i32 1
>   %v2 = insertelement <4 x float> %v1, float %b01, i32 2
>   %v3 = insertelement <4 x float> %v2, float %b23, i32 3
>   ret <4 x float> %v3
> }
>
> -----------------------------------------------------------------------------
> -
>
> There are multiple potential problems contained in this example:
>
> 1. The cost model calculation claims that the extract/insert changes + <2 x
> float> math op are profitable. This may need adjustment either in the
> generic formulation or x86 cost model values.
>
> 2. SLP chooses to round down to vectorize to <2 x float> rather than
> rounding up a potential <3 x float> op to <4 x float>. This is a limitation
> in SLPVectorizerPass::tryToVectorizeList().
>
> 3. SLP chooses to vectorize the more complicated pair of scalar ops rather
> than the simpler pair. Ie, we could vectorize this pair with elements all
> from %b:
>   %b01 = fadd float %b0, %b1
>   %b23 = fadd float %b2, %b3
>
> instead of this pair:
>   %a23 = fadd float %a2, %a3
>   %b01 = fadd float %b0, %b1
>
> 4. As noted in comment 3, instcombine should be able to turn more
> extract+insert sequences into shuffles. We should probably adjust that
> independent of this bug, but I don't think it will change the codegen
> outcome on this example.

I have a very preliminary patch for non-pow2 support in SLP vectorizer, which
will extend <3 x float> to <4 x float>. It is too optimistic in terms of the
cost of the vectorization, but generally speacking looks good. Hope fix all the
problems very soon.

Quuxplusone commented 5 years ago

(In reply to Alexey Bataev from comment #9)
> I have a very preliminary patch for non-pow2 support in SLP vectorizer,
> which will extend <3 x float> to <4 x float>. It is too optimistic in terms
> of the cost of the vectorization, but generally speacking looks good. Hope
> fix all the problems very soon.

Great! If we get a <4 x float> math op from SLP, then that should be no problem
for the backend to match as a horizontal op (if that's good for the target).

I'm still trying to make instcombine produce code that SLP can handle more
easily here:
https://reviews.llvm.org/D53507

Quuxplusone commented 4 years ago

@spatel Do you think this is a candidate for the VectorCombine pass?

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #11)
> @spatel Do you think this is a candidate for the VectorCombine pass?

We already get the extracelement transforms in -vector-combine that I would
expect given the IR in comment 6:
  %0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %1 = fadd <4 x float> %0, %a
  %2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %3 = fadd <4 x float> %2, %b
  %4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef>
  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
  ret <4 x float> %7

------------------------------------------------------------------------------

So SLP is out of the picture currently. The given IR does result in a
horizontal add, but we don't have a way to combine these into a single hadd yet:

    vmovshdup   %xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
    vaddps  %xmm0, %xmm2, %xmm0
    vhaddps %xmm1, %xmm1, %xmm2
    vshufps $200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3]
    vmovshdup   %xmm1, %xmm2    ## xmm2 = xmm1[1,1,3,3]
    vaddps  %xmm1, %xmm2, %xmm1
    vinsertps   $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

Quuxplusone commented 4 years ago

A possible transform for vector-combine that will help:

  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>

We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and
that then gets pushed back over to element 3 in the 2nd shuffle. We want to
adjust the 1st shuffle so that we're calculating the binop in element 3 because
that creates a select ("blend") shuffle instead:

  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>

Quuxplusone commented 4 years ago

Small hack proposal:
https://reviews.llvm.org/D76623

Quuxplusone commented 4 years ago

Further backend improvements: https://reviews.llvm.org/D83789

Quuxplusone / LLVMBugzillaTest

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #33696