Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[ppc] bad scheduling for special register access instructions #25684

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR25685
Status CONFIRMED
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2015-11-30 18:30:09 -0800
Last modified on 2021-03-30 01:12:30 -0700
Version trunk
Hardware PC Linux
CC cycheng@multicorewareinc.com, echristo@gmail.com, fwage73@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, tjablin@gmail.com
Fixed by commit(s)
Attachments ppc-prolog-epilog-inst-schedule.patch (16301 bytes, text/plain)
Blocks
Blocked by
See also
This is a very common instruction sequence.

One example is Perl_sv_setsv_flags from perlbench. On power8, llvm generated
Perl_sv_setsv_flags consumes 4.26% of run time, gcc generated function consumes
3.5% of run time.

perf annotate shows following instructions and cycles of llvm generated code:

      │      00000000100c67d0 <Perl_sv_setsv_flags>:
       │        addis  r2,r12,26
       │        addi   r2,r2,31576
  4.95 │        mfcr   r12
  4.18 │        mflr   r0
  0.07 │        std    r31,-8(r1)
       │        std    r0,16(r1)
  0.02 │        stw    r12,8(r1)
                ...

The two instructions mfcr/mflr consume many time, these are slow instructions,
but the results are used immediately, causes stalling.

For comparison following is gcc generated code:

       │      00000000100d02f0 <Perl_sv_setsv_flags>:
       │        addis  r2,r12,24
       │        addi   r2,r2,21112
  1.63 │        mflr   r0
       │        cmpld  cr7,r4,r3
  1.57 │        std    r30,-16(r1)
  0.01 │        std    r31,-8(r1)
  1.66 │        mfocrf r12,8
       │        std    r26,-48(r1)
       │        std    r27,-40(r1)
  0.02 │        mr     r31,r3
       │        mr     r30,r4
       │        std    r28,-32(r1)
  2.64 │        std    r29,-24(r1)
       │        stw    r12,8(r1)
       │        std    r0,16(r1)
                ...

It has much better scheduling of mflr/mfocrf and corresponding usage
instructions.
Quuxplusone commented 8 years ago

We've discussed similar code sequences, but in the epilogue code (i.e., using mtlr). Have you seen any examples for the move to spr instructions causing stalls also?

Either way, I think any solution should take into account scheduling of both the move-from and move-to spr instructions.

Quuxplusone commented 8 years ago

For move-to case, I noticed only lr register, it is used immediately for return address, cr is usually not used immediately. I can't think out any other spr that is used so frequently.

Quuxplusone commented 8 years ago
There are 2 issues in this llvm generated code:
1. We should use "mfocrf" instead of "mfcr". In our test pattern, replace mfcr
   by mfocrf can have *overall 10%* performance improvements.
2. We should separate mfcr and mflr by 2 ~ 4 instructions, if we use mfocrf,
   it should be 2 instructions.

PPCFrameLowering.cpp said:
 "// FIXME: In the ELFv2 ABI, we are not required to save all CR fields.
  // If only one or two CR fields are clobbered, it could be more
  // efficient to use mfocrf to selectively save just those fields."

So we are going to fix it.

About issue 2, we are trying to use llvm scheduling mechanism. We are
investigating how to do this, any recommendation please let us know.
Quuxplusone commented 8 years ago
Update status:

[SU0] mfcr 12
[SU1] mflr 0
[SU2] std 0, 16(1)
[SU3] stw 12, 8(1)
[SU4] stdu 1, -176(1)

We are able to reorder these instructions in a better sequency. Our current
methods:

1. Move stack allocation [SU4] to the end of basic block (before terminate
instruction), because we need more space to be able to reorder other
instructions.

2. Add 'mflr' scheduling data in PPCScheduleP8.td, we told scheduler that
'mflr' and 'mfcr' use the same resource, and both occupy execution pipe 4
cycles, so post-RA-sched will try to separate them.

3. Set 'mflr' to "no side effect" property, so scheduler can reorder it freely.

4. Add memory operand information for [SU2], [SU3], so scheduler can reorder
them without worrying about memory dependency.

By the way, happy new year 2016 @@ ya
Quuxplusone commented 8 years ago
A typical function prloluge/epilogue on ppc64 looks like

mflr r0                  // P1
std    r0,16(r1)         // P2
...
ld     r0, 16(r1)        // E1
mtlr  r0                 // E2
blr                      // E3

P2 depends on P1, E2 depends E1 and E3 depends on E2. Following code sequence
can completely remove the dependences (P2, P1) (E2, E1),

std   r14, 16(r1)
mflr  r14
...
mtlr  r14
ld     r14, 16(r1)
blr

One problem in this code sequence is the stack frame layout is changed, it can
impact unwinding (and exceptions). Correct dwarf information can solve this
problem(at least in spec2006).

I tested the following patch on spec2006 c/c++ applications. Unfortunately it
doesn't bring obvious performance improvement. Only 3~4s improvement for astar,
it is less than 1%, no changes to other applications.

Index: lib/Target/PowerPC/PPCFrameLowering.cpp
===================================================================
--- lib/Target/PowerPC/PPCFrameLowering.cpp (revision 262420)
+++ lib/Target/PowerPC/PPCFrameLowering.cpp (working copy)
@@ -851,8 +851,13 @@
       .addReg(SPReg);
   }

+  int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14;
   if (MustSaveLR)
-    BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
+    //BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
+    BuildMI(MBB, MBBI, dl, StoreInst)
+      .addReg(SavedLRReg)
+      .addImm(LROffset)
+      .addReg(SPReg);

   if (MustSaveCR &&
       !(SingleScratchReg && MustSaveLR)) { // will only occur for PPC64
@@ -888,10 +893,11 @@

   if (MustSaveLR)
     // FIXME: On PPC32 SVR4, we must not spill before claiming the stackframe.
-    BuildMI(MBB, MBBI, dl, StoreInst)
-      .addReg(ScratchReg)
-      .addImm(LROffset)
-      .addReg(SPReg);
+    //BuildMI(MBB, MBBI, dl, StoreInst)
+    //  .addReg(ScratchReg)
+    //  .addImm(LROffset)
+    //  .addReg(SPReg);
+    BuildMI(MBB, MBBI, dl, MFLRInst, SavedLRReg);

   if (MustSaveCR &&
       !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64
@@ -1014,10 +1020,20 @@
     if (MustSaveLR) {
       // Describe where LR was saved, at a fixed offset from CFA.
       unsigned Reg = MRI->getDwarfRegNum(LRReg, true);
-      CFIIndex = MMI.addFrameInst(
-          MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
+      unsigned gpr = MRI->getDwarfRegNum(SavedLRReg, true);
+      unsigned CFIIndex1 = MMI.addFrameInst(
+          MCCFIInstruction::createOffset(nullptr, gpr, LROffset));
       BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
-          .addCFIIndex(CFIIndex);
+          .addCFIIndex(CFIIndex1);
+      unsigned CFIIndex2 = MMI.addFrameInst(
+          MCCFIInstruction::createRegister(nullptr, Reg, gpr));
+      BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+          .addCFIIndex(CFIIndex2);
+
+      //CFIIndex = MMI.addFrameInst(
+        //  MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
+      //BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+        //  .addCFIIndex(CFIIndex);
     }
   }

@@ -1046,6 +1062,16 @@
     for (unsigned I = 0, E = CSI.size(); I != E; ++I) {
       unsigned Reg = CSI[I].getReg();
       if (Reg == PPC::LR || Reg == PPC::LR8 || Reg == PPC::RM) continue;
+      if ((Reg == SavedLRReg) && MustSaveLR)
+      {
+        unsigned lrReg = MRI->getDwarfRegNum(LRReg, true);
+        int Offset = MFI->getObjectOffset(CSI[I].getFrameIdx());
+        unsigned CFIIndex2 = MMI.addFrameInst(
+            MCCFIInstruction::createOffset(nullptr, lrReg, Offset));
+        BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+            .addCFIIndex(CFIIndex2);
+        continue;
+      }

       // This is a bit of a hack: CR2LT, CR2GT, CR2EQ and CR2UN are just
       // subregisters of CR2. We just need to emit a move of CR2.
@@ -1254,10 +1280,12 @@
         .addReg(TempReg, getKillRegState(i == e-1));
   }

+  int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14;
   if (MustSaveLR)
-    BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
-      .addImm(LROffset)
-      .addReg(SPReg);
+    BuildMI(MBB, MBBI, dl, MTLRInst).addReg(SavedLRReg);
+    //BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
+    //  .addImm(LROffset)
+    //  .addReg(SPReg);

   if (MustSaveCR &&
       !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64
@@ -1289,7 +1317,10 @@
         .addReg(TempReg, getKillRegState(i == e-1));

   if (MustSaveLR)
-    BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
+    //BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
+    BuildMI(MBB, MBBI, dl, LoadInst, SavedLRReg)
+      .addImm(LROffset)
+      .addReg(SPReg);

   // Callee pop calling convention. Pop parameter/linkage area. Used for tail
   // call optimization
Quuxplusone commented 8 years ago

Attached ppc-prolog-epilog-inst-schedule.patch (16301 bytes, text/plain): Let post-RA-scheduler have more space to order instructions

Quuxplusone commented 8 years ago
Hi Carrot,

The patch is our current working. Actually we have made it few weeks ago.
This patch passed our internal full test, we have also benchmarked it.

Below is one of the example and result:

Original:            With the patch (Reordered):

mflr 0               stdu 1, -240(1)
mfcr 12              mflr 0
std 0, 16(1)         std 14, 96(1)
stw 12, 8(1)         std 15, 104(1)
stdu 1, -240(1)      std 16, 112(1)
std 14, 96(1)        std 17, 120(1)
std 15, 104(1)       std 18, 128(1)
std 16, 112(1)       mfcr 12
std 17, 120(1)       std 19, 136(1)
std 18, 128(1)       std 30, 224(1)
std 19, 136(1)       std 31, 232(1)
std 30, 224(1)       std 0, 256(1)
std 31, 232(1)       stw 12, 248(1)
#APP                 #APP

#NO_APP              #NO_APP
bl bar               bl bar
nop                  nop
ld 31, 232(1)        ld 0, 256(1)
ld 30, 224(1)        lwz 12, 248(1)
ld 19, 136(1)        ld 31, 232(1)
ld 18, 128(1)        ld 30, 224(1)
ld 17, 120(1)        ld 19, 136(1)
ld 16, 112(1)        ld 18, 128(1)
ld 15, 104(1)        ld 17, 120(1)
ld 14, 96(1)         ld 16, 112(1)
addi 1, 1, 240       ld 15, 104(1)
ld 0, 16(1)          ld 14, 96(1)
lwz 12, 8(1)         mtocrf 32, 12
mtocrf 32, 12        mtlr 0
mtlr 0               addi 1, 1, 240
blr                  blr

Our SPEC2006-INT/FP benchmark result didn't show obvious performance
improvement, that's why we hesitated about whether to fire the patch or not.

It looks like for out-of-order cpu which has big instruction window, plenty
execution pipe, plenty issue queue can decrease software instruction
shceduler's effort.

http://users.elis.ugent.be/~leeckhou/papers/hipeac08-eyerman.pdf

Anyway, at least this patch relax instruction order constraint in prologue and
epilogue, we will upload it later.

CY
Quuxplusone commented 8 years ago
Hi Carrot,

I uploaded the patch to phabricator:
http://reviews.llvm.org/D18030

I re-benchmark again, and I found it benefit performance, could you download
the patch and test on your machine, thanks!

I also uploaded a patch that try to use mfocrf in prologue when appropriate.
mfocrf has short latency compares to mfcr, we can get additional benefit from
using this instruction.
http://reviews.llvm.org/D17749

CY
Quuxplusone commented 8 years ago
Hi CY

I tested the patch http://reviews.llvm.org/D18030 on power8, following is the
execution time of several runs of mcf,

    w/o      w/ the patch
   222.7s   216.8s
   219.5s   219.0s
   217.9s   221.8s
   220.0s   222.4s

It seems there is no difference.
Quuxplusone commented 8 years ago
(In reply to comment #9)
> Hi CY
>
> I tested the patch http://reviews.llvm.org/D18030 on power8, following is
> the execution time of several runs of mcf,
>
>     w/o      w/ the patch
>    222.7s   216.8s
>    219.5s   219.0s
>    217.9s   221.8s
>    220.0s   222.4s
>
> It seems there is no difference.

Hi Carrot,

Could you please give me your llvm/clang revision?

Also, how did you test performance, e.g. for us, we will bind the running on
physical cpu0, and set cpu0 to performance governor:

sudo cpufreq-set -c 0 -g performance
taskset -c 0 runspec ...

// after the test, reset cpu0 to ondemand governor
sudo cpufreq-set -c 0 -g ondemand

CY
Quuxplusone commented 8 years ago
(In reply to comment #10)
>
> Hi Carrot,
>
> Could you please give me your llvm/clang revision?
>
> Also, how did you test performance, e.g. for us, we will bind the running on
> physical cpu0, and set cpu0 to performance governor:
>
> sudo cpufreq-set -c 0 -g performance
> taskset -c 0 runspec ...
>
> // after the test, reset cpu0 to ondemand governor
> sudo cpufreq-set -c 0 -g ondemand
>
> CY

My llvm/clang revision is 263000.

Our production machine doesn't have cpufreq-set. So I can only run the
benchmarks multiple times to get a stable result.
Quuxplusone commented 8 years ago
(In reply to comment #11)
> My llvm/clang revision is 263000.
>
> Our production machine doesn't have cpufreq-set. So I can only run the
> benchmarks multiple times to get a stable result.

Ok, I will re-run experiments and check performance score again.
Quuxplusone commented 8 years ago
(In reply to comment #12)
> (In reply to comment #11)
> > My llvm/clang revision is 263000.
> >
> > Our production machine doesn't have cpufreq-set. So I can only run the
> > benchmarks multiple times to get a stable result.
>
> Ok, I will re-run experiments and check performance score again.

Hi Carrot,

Sorry, you are right! The performance is not improved as you mentioned. It's a
frustrating result. We need some internal discussion for our next step. I'm
afraid software instruction scheduling in prologue/epilogue is not helpful for
such kind of out-of-order machine :(

CY
Quuxplusone commented 8 years ago
> Hi Carrot,
>
> Sorry, you are right! The performance is not improved as you mentioned. It's
> a frustrating result. We need some internal discussion for our next step.
> I'm afraid software instruction scheduling in prologue/epilogue is not
> helpful for such kind of out-of-order machine :(
>
> CY

I guess the problem is in a typical prolog/epilog most of the instructions are
low latency instructions except mflr, it means the instructions before mflr can
finish execution and retirement quickly, instructions after mflr must wait for
the retirement of mflr since retirement must be in order. So the latency of
mflr can't be hidden by scheduling.
Quuxplusone commented 8 years ago
I agree your comment.

I did some experiments on 400.perlbench, and I found "-fno-strict-aliasing" may
be the main reason that cause current llvm generates slower code than gcc.

In my case, I found there were extra flags (EXTRA_CFLAGS) used by 400.perlbench:
-fgnu89-inline -fno-strict-aliasing

I removed "-fno-strict-aliasing" when compiling by llvm and I got 1%
performance improvement.

My experiments:
Base options: -m64 -O3 -mcpu=power8 -funroll-loops -mno-vsx -ffast-math
              -mrecip=!divd -fomit-frame-pointer

With -fno-strict-aliasing:
  clang-3.9.0 (r266153)   = 557.33 sec
  at9.0 (gcc5.2.1)        = 549.33 sec

w/o -fno-strict-aliasing:
  clang-3.9.0 (r266153)   = 551.33 sec
  at9.0 (gcc5.2.1)        = 563.67 sec

So, looks like "-fno-strict-aliasing" is the root cause.
Did you also use "-fno-strict-aliasing" in EXTRA_CFLAGS to compile
400.perlbench?
Quuxplusone commented 8 years ago

Hi Carrot,

Just let you know we are still discussing this issue internally. And it might take times to have conclusion.

CY