Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[ppc] Missed vectorization #29963

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR30990
Status NEW
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2016-11-11 17:32:53 -0800
Last modified on 2019-09-26 01:50:15 -0700
Version trunk
Hardware PC Linux
CC ehsanamiri@gmail.com, fwage73@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, mkuper@google.com, nemanja.i.ibm@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
The source code is:

int foo(char* ptr, int l) {
  const char* const end = ptr + l;
  int count = 0;
  while (ptr < end) {
    count += ((signed char)(*ptr) < -0x40) ? 1 : 0;
    ptr++;
  }
  return count;
}

When compiled with options -m64 -O2, llvm unrolls the loop 12 times,

        ...
.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        lbzu 0, 12(3)
        ld 5, -160(1)                   # 8-byte Folded Reload
        addi 7, 7, -12
        lbz 20, 1(3)
        lbz 19, 2(3)
        lbz 18, 3(3)
        lbz 14, 4(3)
        cmpld    5, 7
        extsb 0, 0
        lbz 5, 5(3)
        lbz 6, 7(3)
        cmpwi 1, 0, -64
        lbz 8, 9(3)
        extsb 2, 20
        extsb 19, 19
        extsb 18, 18
        extsb 20, 14
        cmpwi 6, 2, -64
        lbz 2, 6(3)
        cmpwi 7, 19, -64
        extsb 0, 5
        lbz 5, 8(3)
        cmpwi 5, 18, -64
        isel 16, 10, 9, 24
        extsb 18, 6
        cmpwi 6, 20, -64
        extsb 19, 2
        extsb 20, 5
        add 12, 16, 12
        isel 14, 10, 9, 20
        cmpwi 5, 19, -64
        lbz 19, 10(3)
        isel 2, 10, 9, 24
        cmpwi 6, 18, -64
        lbz 18, 11(3)
        add 29, 14, 29
        isel 15, 10, 9, 28
        cmpwi 7, 0, -64
        extsb 0, 8
        add 28, 2, 28
        isel 6, 10, 9, 28
        cmpwi 7, 20, -64
        extsb 20, 19
        add 30, 15, 30
        isel 5, 10, 9, 20
        cmpwi 5, 0, -64
        extsb 0, 18
        add 27, 6, 27
        isel 8, 10, 9, 24
        cmpwi 6, 20, -64
        add 25, 5, 25
        isel 19, 10, 9, 28
        cmpwi 7, 0, -64
        add 24, 8, 24
        isel 17, 10, 9, 4
        add 22, 19, 22
        isel 18, 10, 9, 20
        add 11, 17, 11
        isel 20, 10, 9, 24
        add 26, 18, 26
        isel 0, 10, 9, 28
        add 23, 20, 23
        add 21, 0, 21
        bne      0, .LBB0_4
        ...
// the rest iterations

GCC can vectorize the loop:
         ...
.L4:
        sldi 5,7,4
        addi 7,7,1
        lxvd2x 33,8,5
        xxpermdi 33,33,33,2
        vcmpgtsb 1,2,1
        xxsel 33,35,36,33
        vperm 12,1,5,7
        vperm 1,1,5,8
        vperm 6,12,0,9
        vperm 12,12,0,10
        vperm 11,1,0,9
        vadduwm 6,6,13
        vperm 13,1,0,10
        vadduwm 12,12,6
        vadduwm 12,11,12
        vadduwm 13,13,12
        bdnz .L4
        ...
// the rest iterations

In one of our internal testcase, llvm version is 2.7x times slower than gcc on
power8.
Quuxplusone commented 7 years ago
Looks like a problem in PPCTTIImpl::getMemoryOpCost, when load a type
<4 x i8>
The cost is 41.
Quuxplusone commented 7 years ago
When VF == 4, MemVT == MVT::v4i8, ValVT == MVT::v16i8,
function getLoadExtAction returns Expand, so getMemoryOpCost tries to compute
the cost of building a vector from scalar value, that is too high.
Actually VSX has lxsiwax which can load 32 bit value to VSX register. So I add
following code to PPCTargetLowering::PPCTargetLowering()

setLoadExtAction(ISD::EXTLOAD, MVT::v16i8, MVT::v4i8, Custom);

Then llvm can vectorize the loop, with -fno-unroll-loops option, now it
generates:

.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        lxsiwax 37, 0, 3
        lxvd2x 0, 0, 9
        addi 10, 10, -4
        addi 3, 3, 4
        cmpldi   10, 0
        xxspltw 37, 37, 1
        xxswapd  32, 0
        vcmpgtsb 5, 2, 5
        vperm 5, 5, 5, 0
        xxland 37, 37, 36
        vadduwm 3, 5, 3
        bne      0, .LBB0_4
        ...

There is still a problem, VSX instructions can handle 16 bytes in each
instruction, why doesn't vectorizer try vector factor 16?
Quuxplusone commented 7 years ago
Because the widest type in the loop is 32 bits wide.
Passing -vectorizer-maximize-bandwidth will vectorize by 16.

I've started working on making -vectorizer-maximize-bandwidth the default, but
it keeps getting pushed towards the bottom of my todo list.