Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

SLP Vectorizer: some issues with getSpillCost() #26541

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR26542
Status NEW
Importance P normal
Reported by ayal.zaks@intel.com
Reported on 2016-02-09 05:06:45 -0800
Last modified on 2016-02-16 13:21:55 -0800
Version trunk
Hardware PC Windows NT
CC ayal.zaks@intel.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, nadav.rotem@me.com
Fixed by commit(s)
Attachments slpMin.ll (2634 bytes, application/octet-stream)
Blocks
Blocked by
See also
Created attachment 15861
ll version of the example

BoUpSLP::getSpillCost() is said to do the following:
  // Walk from the bottom of the tree to the top, tracking which values are
  // live. When we see a call instruction that is not part of our tree,
  // query TTI to see if there is a cost to keeping values live over it
  // (for example, if spills and fills are required).

Some issues (with the implementation, not the documentation ;-):

1. The first sentence above assumes that a traversal of the VectorizableTree[]
array from its first entry (0, the root) to its last entry, corresponds to a
bottom-up traversal of the associated instructions in program order, thereby
supporting liveness data-flow analysis. However, the entries in this array are
not guaranteed to appear in program order. In the example below involving a
subtraction the minuend appears closer to the root of the VectorizableTree than
the subtrahend, contrary to the order in which they appear inside the basic
block. This messes up the liveness scan. Thanks to Michael Kuperstein for a
lively discussion on this.

2. The second sentence above indicates that we accumulate a spill/fill cost per
call instruction. However, if the associated calling conventions provide callee-
saved vector registers, we may be able to pay this cost once. Furthermore, even
if all associated vector registers are caller-saved, if a live range between
two VectorizableTree entries contains several calls, as in the example below,
we can spill/fill once for all.

3. In order to support PR17745 (SLP Vectorizer: Partial Tree Vectorization), we
should attribute an individual spill cost to each VectorizableTree entry.

4. A live range may span more than two basic blocks. If we want to keep
checking only two (at both ends of the live range) in the interest of compile
time, add a comment.

5. Finally, targets that do not override the default
getCostOfKeepingLiveOverCall() which returns 0, should avoid searching for
calls throughout the live ranges of all values in the SLP chain, just to
accumulate zero costs.

Consider the following example, SLP vectorizing to VF=4:

void slp (int, int);

void slp_call (int *a, int * b, int n)
{
  int a0 = a[1];
  int a1 = a[2];
  int a2 = a[3];
  int a3 = a[4];
  slp(n, *a);
  slp(n, *a);
  slp(n, *a);
  int b0 = b[2];
  int b1 = b[3];
  int b2 = b[4];
  int b3 = b[5];
  a0 = a0 - b0;
  a1 = a1 - b1;
  a2 = a2 - b2;
  a3 = a3 - b3;
  b[2] = a0;
  b[3] = a1;
  b[4] = a2;
  b[5] = a3;
}

Clearly only one vector register needs to be spilled and filled before and
after the 3 calls to slp() (if at all, considering remat, but that's another
story). But as we start to traverse up the tree, we first account for the
single live value feeding the store, then the 2 live values feeding the
subtract (erasing the former single value); next we will reach the loads from
a[1:4] (not the loads from b[2:5]), and before doing so we encounter the 3
calls and accumulate a cost of spilling/filling the above 2 live values each.

This order is evident by running -slp-vectorizer -dump:
SLP: #LV: 1 sub, Looking at   %sub = sub nsw i32 %0, %7
SLP: #LV: 2  , Looking at   %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
SLP: #LV: 1 , Looking at   %7 = load i32, i32* %arrayidx4, align 4, !tbaa !1
Quuxplusone commented 8 years ago

Attached slpMin.ll (2634 bytes, application/octet-stream): ll version of the example

Quuxplusone commented 8 years ago

Eyal, thank you for looking at this. I think that modeling spills at the SLP-vectorizer level is a really bad idea. The SLP vectorizer usually reduces register pressure because it enables the use of vector registers in addition to scalar registers. Attempting to predict register pressure is pointless. It's very difficult to guess what ISel and the scheduler would do, especially when vectorizing across basic blocks. This whole thing feels like a hack that was inserted to handle one specific workload and I suspect that if we simply remove this code we won't see any regressions in the LLVM test suite.

I suggest that we rip out getSpillCost.

Quuxplusone commented 8 years ago
Indeed, remat or e.g. scheduling calls above loads can mitigate spill costs, as
hinted in the attached example. And using vector in addition to scalar
registers should be an advantage, if they are available. However, consider the
3rd issue - in order to best support TSLP, if we encounter calls whose
conventions have no callee saved vector registers, this should affect our
throttling process, right?
As for ripping getSpillCost out, this should best be championed by its authors.
This PR suggests improvements and implicitly that its tests should cover a
wider range of tree(s).