Open Quuxplusone opened 8 years ago
Bugzilla Link | PR28505 |
Status | NEW |
Importance | P normal |
Reported by | Sanjay Patel (spatel+llvm@rotateright.com) |
Reported on | 2016-07-11 11:48:06 -0700 |
Last modified on | 2017-06-03 08:20:34 -0700 |
Version | trunk |
Hardware | PC All |
CC | andrea.dibiagio@gmail.com, ayman.musa@intel.com, david.l.kreitzer@intel.com, elad2.cohen@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, zvirack@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR31331, PR32564 |
Hi Sanjay,
IIUC it looks like the code controlling which vector will get a splat constant
load is:
// Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
// For size optimization, also splat v2f64 and v2i64, and for size opt
// with AVX2, also splat i8 and i16.
// With pattern matching, the VBROADCAST node may become a VMOVDDUP.
if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
(OptForSize && (ScalarSize == 64 || Subtarget.hasAVX2()))) {
committed by you in https://reviews.llvm.org/rL218263
I took your test case and added optforsize and indeed all the cases got a
broadcast.
Do you recall any specific reason on why we should limit this to f32, i32,
v4f64, v4i64 only? (When using AVX2 and no optforsize I mean)
(In reply to comment #1)
> Hi Sanjay,
>
> IIUC it looks like the code controlling which vector will get a splat
> constant load is:
>
> // Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
> // For size optimization, also splat v2f64 and v2i64, and for size opt
> // with AVX2, also splat i8 and i16.
> // With pattern matching, the VBROADCAST node may become a VMOVDDUP.
> if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
> (OptForSize && (ScalarSize == 64 || Subtarget.hasAVX2()))) {
>
> committed by you in https://reviews.llvm.org/rL218263
>
> I took your test case and added optforsize and indeed all the cases got a
> broadcast.
>
> Do you recall any specific reason on why we should limit this to f32, i32,
> v4f64, v4i64 only? (When using AVX2 and no optforsize I mean)
Wow...I almost remember that patch. :)
But no, I have no idea why the AVX2 path was limited to only 32/64-bit elements
and 256-bit vectors. It's probably just a leftover restriction from AVX1 that
needs to be loosened?
I tried to measure the impact for enabling this for AVX2 and I didn't see gains. What I did see is some regressions, I didn't get to analyze everything, but in general it looks like some loads from the constant pool which changed into broadcasts got spilled instead of being rematerialized - so me might want to consider reiterating on this after https://bugs.llvm.org//show_bug.cgi?id=31331 is fixed.
Given Andrea and Zvi's comments in bug 31331, I'm also now wondering why we would use broadcast for any of these cases.
I may have been misguided in r218263: if most CPUs have lower throughput and/or longer latency for broadcasts, then we should favor regular loads unless we are optimizing for size.
(In reply to Sanjay Patel from comment #4)
> Given Andrea and Zvi's comments in bug 31331, I'm also now wondering why we
> would use broadcast for any of these cases.
>
> I may have been misguided in r218263: if most CPUs have lower throughput
> and/or longer latency for broadcasts, then we should favor regular loads
> unless we are optimizing for size.
I am still convinced that we should probably always select a single load from
constant pool instead of a scalar load plus broadcast.
If we are worried about code size, then we can implement a post-ra fixup pass
that:
1) Identifies constant pool values which are splat vectors.
2) For each constant identified during step 1), see if it is profitable to materialize a broadcast before every user. Note that this would introduce new definitions, so this step dependends on the availability of a vector register.
3) If we were able to fixup all the users of a constant, then we can replace the constant with a scalar constant.
We would go through those steps only if we are optimizing for size, and the CPU
requested that pass (for example: this may become a check on target features).
That being said, this is just an idea.
What I am trying to say is that we can defer the expansion of a constant vector load into a scalar load+broadcast after regalloc.
If we prematurely materialize a broadcast during ISel, then we (slightly) increase register pressure. If we are unlucky, then this may lead to suboptimal code like in bug 31331.
We can probably be a bit more aggressive during ISel (and, in general, before regalloc) provided that we fixup the code after regalloc by materializing broadcasts.
With AVX-512, in some cases we can fold broadcasts into broadcast memory ops.
For example,
vmulps (%r12){1to16}, %zmm4, %zmm5
So it we may want the flexibility to have broadcast motion in and out of loops
by folding and unfolding memory ops, similar to what we do with full vector
moves. This is just another consideration.
Andrea's comments make sense and i agree we should prefer a simple and robust
design over marginal code size savings. So if deferring the decision about
generation of broadcasts until after RA will give us these benefits, then i
agree.