Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

SLP regression on SystemZ #31378

Closed Quuxplusone closed 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR32406
Status RESOLVED FIXED
Importance P normal
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2017-03-24 06:33:24 -0700
Last modified on 2017-04-10 07:13:02 -0700
Version trunk
Hardware PC Linux
CC llvm-bugs@lists.llvm.org, mssimpso@codeaurora.org, spatel+llvm@rotateright.com, tstellar@redhat.com
Fixed by commit(s)
Attachments SetupFastFullPelSearch.ll (78108 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 18165
function that loop lives in -- not able to reduce furhter.

I have come across a major regression resulting after SLP vectorization (+18%
on SystemZ, just for enabling SLP). This all relates to one particular very hot
loop.

Scalar code:
  %conv252 = zext i16 %110 to i64
  %conv254 = zext i16 %111 to i64
  %sub255 = sub nsw i64 %conv252, %conv254
  ... repeated

SLP output:
  %101 = zext <16 x i16> %100 to <16 x i64>
  %104 = zext <16 x i16> %103 to <16 x i64>
  %105 = sub nsw <16 x i64> %101, %104
  %106 = trunc <16 x i64> %105 to <16 x i32>
  for each element e 0:15
   %107 = extractelement <16 x i32> %106, i32 e
   %108 = sext i32 %107 to i64

The vectorized code should in this case only have to be

  %101 = zext <16 x i16> %100 to <16 x i64>
  %104 = zext <16 x i16> %103 to <16 x i64>
  %105 = sub nsw <16 x i64> %101, %104
  for each element e 0:15
   %107 = extractelement <16 x i64> %105, i32 e

,but this does not get handled so for all the 16 elements, extracts *and
extends* are done.

I see that there is a special function in SLP vectorizer that does this
truncation and extract+extend whenever possible. Is this the place to fix this?

Or would it be better to rely on InstCombiner?

Is this truncation done by SLP with the assumption that it is free to extend an
extracted element? On SystemZ, this is not true.

/Jonas

Run
bin/opt -O3 -S -o out.opt.ll -mtriple=s390x-linux-gnu -mcpu=z13
SetupFastFullPelSearch.ll

Bad code lives in for.body249 after SLP
Quuxplusone commented 7 years ago

Attached SetupFastFullPelSearch.ll (78108 bytes, text/plain): function that loop lives in -- not able to reduce furhter.

Quuxplusone commented 7 years ago

This particular case was handled by improving the cost functions by understanding that target has an extending scalar load that is cheaper.

See https://reviews.llvm.org/D29631 (BasicTTIImpl.h)