[x86, AVX2] inconsistent usage of vpbroadcast

Quuxplusone commented 8 years ago


Bugzilla Link	PR28505
Status	NEW
Importance	P normal
Reported by	Sanjay Patel (spatel+llvm@rotateright.com)
Reported on	2016-07-11 11:48:06 -0700
Last modified on	2017-06-03 08:20:34 -0700
Version	trunk
Hardware	PC All
CC	andrea.dibiagio@gmail.com, ayman.musa@intel.com, david.l.kreitzer@intel.com, elad2.cohen@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, zvirack@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR31331, PR32564

Noticed this while writing tests for bug 28484: we only use a splat load of the
constant for the i32 ands. vpbroadcast in AVX2 should work for all of m8, m16,
m32, m64 and 128/256 vectors. Missing patterns?

Possibly surprising mapping of bool vectors to xmm/ymm is a separate issue.

$ cat vpbroadcast.ll
define <16 x i8> @and16x8(<16 x i1> %a, <16 x i8> %b) {
  %and = and <16 x i1> %a, <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>
  %zext = zext <16 x i1> %and to <16 x i8>
  %add = add <16 x i8> %zext, %b ; force domain
  ret <16 x i8> %add
}

define <8 x i16> @and8x16(<8 x i1> %a, <8 x i16> %b) {
  %and = and <8 x i1> %a, <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>
  %zext = zext <8 x i1> %and to <8 x i16>
  %add = add <8 x i16> %zext, %b ; force domain
  ret <8 x i16> %add
}

define <4 x i32> @and4x32(<4 x i1> %a, <4 x i32> %b) {
  %and = and <4 x i1> %a, <i1 1, i1 1, i1 1, i1 1>
  %zext = zext <4 x i1> %and to <4 x i32>
  %add = add <4 x i32> %zext, %b ; force domain
  ret <4 x i32> %add
}

define <2 x i64> @and2x64(<2 x i1> %a, <2 x i64> %b) {
  %and = and <2 x i1> %a, <i1 1, i1 1>
  %zext = zext <2 x i1> %and to <2 x i64>
  %add = add <2 x i64> %zext, %b ; force domain
  ret <2 x i64> %add
}

define <32 x i8> @and32x8(<32 x i1> %a, <32 x i8> %b) {
  %and = and <32 x i1> %a, <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>
  %zext = zext <32 x i1> %and to <32 x i8>
  %add = add <32 x i8> %zext, %b ; force domain
  ret <32 x i8> %add
}

define <16 x i16> @and16x16(<16 x i1> %a, <16 x i16> %b) {
  %and = and <16 x i1> %a, <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>
  %zext = zext <16 x i1> %and to <16 x i16>
  %add = add <16 x i16> %zext, %b ; force domain
  ret <16 x i16> %add
}

define <8 x i32> @and8x32(<8 x i1> %a, <8 x i32> %b) {
  %and = and <8 x i1> %a, <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>
  %zext = zext <8 x i1> %and to <8 x i32>
  %add = add <8 x i32> %zext, %b ; force domain
  ret <8 x i32> %add
}

define <4 x i64> @and4x64(<4 x i1> %a, <4 x i64> %b) {
  %and = and <4 x i1> %a, <i1 1, i1 1, i1 1, i1 1>
  %zext = zext <4 x i1> %and to <4 x i64>
  %add = add <4 x i64> %zext, %b ; force domain
  ret <4 x i64> %add
}

$ ./llc -o - vpbroadcast.ll -mattr=avx2 | grep vp
    vpand   LCPI0_0(%rip), %xmm0, %xmm0
    vpaddb  %xmm1, %xmm0, %xmm0

    vpand   LCPI1_0(%rip), %xmm0, %xmm0
    vpaddw  %xmm1, %xmm0, %xmm0

    vpbroadcastd    LCPI2_0(%rip), %xmm2
    vpand   %xmm2, %xmm0, %xmm0
    vpaddd  %xmm1, %xmm0, %xmm0

    vpand   LCPI3_0(%rip), %xmm0, %xmm0
    vpaddq  %xmm1, %xmm0, %xmm0

    vpand   LCPI4_0(%rip), %ymm0, %ymm0
    vpaddb  %ymm1, %ymm0, %ymm0

    vpand   LCPI5_0(%rip), %xmm0, %xmm0
    vpmovzxbw   %xmm0, %ymm0
    vpaddw  %ymm1, %ymm0, %ymm0

    vpand   LCPI6_0(%rip), %xmm0, %xmm0
    vpmovzxwd   %xmm0, %ymm0
    vpaddd  %ymm1, %ymm0, %ymm0

    vpbroadcastd    LCPI7_0(%rip), %xmm2
    vpand   %xmm2, %xmm0, %xmm0
    vpmovzxdq   %xmm0, %ymm0
    vpaddq  %ymm1, %ymm0, %ymm0

Quuxplusone commented 7 years ago

Hi Sanjay,

IIUC it looks like the code controlling which vector will get a splat constant
load is:

    // Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
    // For size optimization, also splat v2f64 and v2i64, and for size opt
    // with AVX2, also splat i8 and i16.
    // With pattern matching, the VBROADCAST node may become a VMOVDDUP.
    if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
        (OptForSize && (ScalarSize == 64 || Subtarget.hasAVX2()))) {

committed by you in https://reviews.llvm.org/rL218263

I took your test case and added optforsize and indeed all the cases got a
broadcast.

Do you recall any specific reason on why we should limit this to f32, i32,
v4f64, v4i64 only? (When using AVX2 and no optforsize I mean)

Quuxplusone commented 7 years ago

(In reply to comment #1)
> Hi Sanjay,
>
> IIUC it looks like the code controlling which vector will get a splat
> constant load is:
>
>     // Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
>     // For size optimization, also splat v2f64 and v2i64, and for size opt
>     // with AVX2, also splat i8 and i16.
>     // With pattern matching, the VBROADCAST node may become a VMOVDDUP.
>     if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
>         (OptForSize && (ScalarSize == 64 || Subtarget.hasAVX2()))) {
>
> committed by you in https://reviews.llvm.org/rL218263
>
> I took your test case and added optforsize and indeed all the cases got a
> broadcast.
>
> Do you recall any specific reason on why we should limit this to f32, i32,
> v4f64, v4i64 only? (When using AVX2 and no optforsize I mean)

Wow...I almost remember that patch. :)
But no, I have no idea why the AVX2 path was limited to only 32/64-bit elements
and 256-bit vectors. It's probably just a leftover restriction from AVX1 that
needs to be loosened?

Quuxplusone commented 7 years ago

I tried to measure the impact for enabling this for AVX2 and I didn't see gains. What I did see is some regressions, I didn't get to analyze everything, but in general it looks like some loads from the constant pool which changed into broadcasts got spilled instead of being rematerialized - so me might want to consider reiterating on this after https://bugs.llvm.org//show_bug.cgi?id=31331 is fixed.

Quuxplusone commented 7 years ago

Given Andrea and Zvi's comments in bug 31331, I'm also now wondering why we would use broadcast for any of these cases.

I may have been misguided in r218263: if most CPUs have lower throughput and/or longer latency for broadcasts, then we should favor regular loads unless we are optimizing for size.

Quuxplusone commented 7 years ago

(In reply to Sanjay Patel from comment #4)
> Given Andrea and Zvi's comments in bug 31331, I'm also now wondering why we
> would use broadcast for any of these cases.
>
> I may have been misguided in r218263: if most CPUs have lower throughput
> and/or longer latency for broadcasts, then we should favor regular loads
> unless we are optimizing for size.

I am still convinced that we should probably always select a single load from
constant pool instead of a scalar load plus broadcast.

If we are worried about code size, then we can implement a post-ra fixup pass
that:
 1) Identifies constant pool values which are splat vectors.
 2) For each constant identified during step 1), see if it is profitable to materialize a broadcast before every user. Note that this would introduce new definitions, so this step dependends on the availability of a vector register.
 3) If we were able to fixup all the users of a constant, then we can replace the constant with a scalar constant.

We would go through those steps only if we are optimizing for size, and the CPU
requested that pass (for example: this may become a check on target features).

That being said, this is just an idea.

Quuxplusone commented 7 years ago

What I am trying to say is that we can defer the expansion of a constant vector load into a scalar load+broadcast after regalloc.

If we prematurely materialize a broadcast during ISel, then we (slightly) increase register pressure. If we are unlucky, then this may lead to suboptimal code like in bug 31331.

We can probably be a bit more aggressive during ISel (and, in general, before regalloc) provided that we fixup the code after regalloc by materializing broadcasts.

Quuxplusone commented 7 years ago

With AVX-512, in some cases we can fold broadcasts into broadcast memory ops.
For example,
 vmulps    (%r12){1to16}, %zmm4, %zmm5

So it we may want the flexibility to have broadcast motion in and out of loops
by folding and unfolding memory ops, similar to what we do with full vector
moves. This is just another consideration.

Andrea's comments make sense and i agree we should prefer a simple and robust
design over marginal code size savings. So if deferring the decision about
generation of broadcasts until after RA will give us these benefits, then i
agree.

Quuxplusone / LLVMBugzillaTest

[x86, AVX2] inconsistent usage of vpbroadcast #28504