[ppc] redundant memory store and load for vector::push_back

Quuxplusone commented 8 years ago


Bugzilla Link	PR30556
Status	NEW
Importance	P normal
Reported by	Carrot (carrot@google.com)
Reported on	2016-09-28 17:07:41 -0700
Last modified on	2017-03-29 23:43:28 -0700
Version	trunk
Hardware	PC Linux
CC	echristo@gmail.com, ehsanamiri@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

When compile following source code with options

-m64 -O2 -mvsx -mcpu=power8 -std=c++11

#include <cstdint>
#include <vector>
#include <algorithm>

using std::vector;

typedef int64_t int64;
typedef uint64_t uint64;

struct O {
  int64 f1;
  int f2;
  int f3;
};

class B {
 public:
  bool Add(int64 d, vector<uint64> p) {
    const int s = f5.size();
    O o;
    o.f1 = d;               // *
    o.f2 = s;               // *
    o.f3 = p.size();        // *
    f4.push_back(o);        // *
    return true;
  }

  int size() {
    return f4.size();
  }

  vector<O> f4;
  vector<uint64> f5;
};

vector<int64> foo();

int bar(vector<uint64>* y) {
  auto h = foo();
  B b;
  std::for_each(h.rbegin(), h.rend(), [&](const int64& hit) {
    b.Add(hit, *y);
  });

  return b.size();
}

LLVM generates following instructions for statements marked with *

         ...
.LBB0_8:                                # %_ZNSt6vectorImSaImEEC2ERKS1_.exit.i.i
                                        #   in Loop: Header=BB0_2 Depth=1
        ld 3, 128(31)
        ld 4, 120(31)
        ld 5, 112(31)
        rldicl 6, 26, 61, 3
        std 23, 168(31)              // o.f1 = d;
        stw 6, 180(31)               // o.f3 = p.size();
        sub      4, 3, 4
        ld 3, 104(31)
        rldicl 4, 4, 61, 3
        stw 4, 176(31)               // o.f2 = s;
        cmpld    3, 5                // vector full?
        beq      0, .LBB0_10
# BB#9:                                 # %if.then.i.i.i.i1
                                        #   in Loop: Header=BB0_2 Depth=1
        lxvd2x 0, 0, 28              // copy object o
        stxvd2x 0, 0, 3              // to vector f4
        ori 2, 2, 0
        ld 3, 104(31)
        addi 3, 3, 16
        std 3, 104(31)
        b .LBB0_11
        ...

Note that the several st instructions construct the local object o, lxvd2x
immediately load it into register, it is very slow due to store forwarding. At
the same time, the value of o are still in register r23,r6,r4, we can directly
store these registers to vector.
Or even better, it is not needed to construct the local object o in BB8, it is
only required in slow path and before calling
_ZNSt6vectorI1OSaIS0_EE19_M_emplace_back_auxIJRKS0_EEEvDpOT_.

Quuxplusone commented 7 years ago

This is the relevant part of  IR before lowering (I reduced the test case a
little more, so there might be irrelevant differences)

invoke.cont3:
  store i64 %3, i64* %f1.i49, align 8, !tbaa !13
  store i32 %conv.i, i32* %f2.i, align 8, !tbaa !16
  store i32 %conv3.i, i32* %f3.i, align 4, !tbaa !17
   %cmp.i.i = icmp eq %struct.O* %21, %22
  br i1 %cmp.i.i, label %if.else.i.i, label %if.then.i.i

if.then.i.i:                                      ; preds = %invoke.cont3
  %23 = bitcast %struct.O* %o.i to i8*
  %24 = bitcast %struct.O* %21 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %24, i8* nonnull %23, i64 16, i32 8, i1 false) #6, !tbaa.struct !22

if.else.i.i:                                      ; preds = %invoke.cont3
  %f4.i51 = bitcast %class.B* %b to %"class.std::vector.5"*
  invoke void @_ZNSt6vectorI1OSaIS0_EE19_M_emplace_back_auxIJRKS0_EEEvDpOT_(%"class.std::vector.5"* nonnull %f4.i51, %struct.O* nonnull dereferenceable(16) %o.i)
          to label %invoke.cont5 unwind label %lpad4

So if the vector is full, we go to the else branch where we pass a pointer to
o.i to _M_emplace_back_aux. In this case, storing to memory is needed.

We need to
(1) recognize that the input to memcpy is stored in the predecessor block, and
it is avaialble in SSA variables, so memcpy can be replaced by store to avoid
extra load.
(2) Once we do this, we need a forward store motion optimization, to move
stores to the else branch.

Memcpy  is from a function void __gnu_cxx::new_allocator<O>::construct<O, O
const&>(O*, O const&), and ends up in function bar after multiple (I think 4)
levels of inlining
Given that memcpy is not generated by the compiler,  The first question is
which pass and when should handle (1).

Quuxplusone commented 7 years ago

Carrot,

The reduced test case that you have provided, is a very good use case for vector::emplace_back. Is this an option for you to change the source code, and use emplace_back instead of push_back?

I have not checked the PPC code generated for emplace_back, but given the functionality, I expect it to solve exactly this problem. I thought to check with you if source code change is an option or not. If it is, I can double check to ensure we generate good code for emplace_back.

(Also I don't think the problem here is limited to PPC. I believe if we want to fix this, it should be done entirely in the mid-end. Maybe GVN? I hope we can avoid fixing the problem in the compiler though)

Quuxplusone commented 7 years ago

from a quick look at the assembly generated, It seems that using emplace_back,
fixes the problem (1) that I mentioned in comment (1). The problem (2) is
probably still there. ( see my comments marked with  *****)

Code with emplace_back:
.LBB0_6:                                # %invoke.cont3
        ld 3, 128(31)
        ld 4, 120(31)
        ld 6, 112(31)
        std 28, 184(31)
        sub      4, 3, 4
        ld 3, 104(31)
        rldicl 5, 4, 61, 3
        sradi 4, 29, 3
        stw 5, 180(31)
        std 4, 168(31)
        cmpld    3, 6
        beq      0, .LBB0_8
# BB#7:                                 # %if.then.i.i
        addi 29, 3, 16
        stw 5, 8(3)
        std 28, 0(3)
        stw 4, 12(3)
        std 29, 104(31)
        b .LBB0_9
.LBB0_8:                                # %if.else.i.i
.Ltmp5:
        addi 3, 31, 96
        addi 4, 31, 184
        addi 5, 31, 180
        addi 6, 31, 168
        bl _ZNSt6vectorI1OSaIS0_EE19_M_emplace_back_auxIJRlRKimEEEvDpOT_
        nop

.LBB0_6:       // This block still contains all the stores *****
        ld 3, 128(31)
        ld 4, 120(31)
        ld 5, 112(31)
        rldicl 6, 29, 61, 3
        std 28, 176(31)
        stw 6, 188(31)
        sub      4, 3, 4
        ld 3, 104(31)
        rldicl 4, 4, 61, 3
        stw 4, 184(31)
        cmpld    3, 5
        beq      0, .LBB0_8
# BB#7:                                 # %if.then.i.i
        addi 4, 31, 176
        lxvd2x 0, 0, 4   // This load is gone. *****
        stxvd2x 0, 0, 3
        ori 2, 2, 0
        ld 3, 104(31)   // This load is gone *****
        addi 3, 3, 16
        std 3, 104(31)
        b .LBB0_9
.LBB0_8:                                # %if.else.i.i
.Ltmp5:
        addi 3, 31, 96
        addi 4, 31, 176
        bl _ZNSt6vectorI1OSaIS0_EE19_M_emplace_back_auxIJRKS0_EEEvDpOT_
        nop
.Ltmp6:

Quuxplusone commented 7 years ago

In comment 3 I forgot to indicate where the code with push_back starts. Look at the basic block label .LBB0_6: which is the starting point of each code segment.

Quuxplusone commented 7 years ago

(In reply to comment #2)
> Carrot,
>
> The reduced test case that you have provided, is a very good use case for
> vector::emplace_back. Is this an option for you to change the source code,
> and use emplace_back instead of push_back?
>
> I have not checked the PPC code generated for emplace_back, but given the
> functionality, I expect it to solve exactly this problem. I thought to check
> with you if source code change is an option or not. If it is, I can double
> check to ensure we generate good code for emplace_back.
>
> (Also I don't think the problem here is limited to PPC. I believe if we want
> to fix this, it should be done entirely in the mid-end. Maybe GVN? I hope we
> can avoid fixing the problem in the compiler though)

I agree with you this is a mid-end problem. But unfortunately I can't modify
the source code.

Quuxplusone commented 7 years ago

Definitely different code generation now, haven't looked at the performance implications though.

Quuxplusone / LLVMBugzillaTest

[ppc] redundant memory store and load for vector::push_back #29529