Open Quuxplusone opened 8 years ago
We've discussed similar code sequences, but in the epilogue code (i.e., using mtlr). Have you seen any examples for the move to spr instructions causing stalls also?
Either way, I think any solution should take into account scheduling of both the move-from and move-to spr instructions.
For move-to case, I noticed only lr register, it is used immediately for return address, cr is usually not used immediately. I can't think out any other spr that is used so frequently.
There are 2 issues in this llvm generated code:
1. We should use "mfocrf" instead of "mfcr". In our test pattern, replace mfcr
by mfocrf can have *overall 10%* performance improvements.
2. We should separate mfcr and mflr by 2 ~ 4 instructions, if we use mfocrf,
it should be 2 instructions.
PPCFrameLowering.cpp said:
"// FIXME: In the ELFv2 ABI, we are not required to save all CR fields.
// If only one or two CR fields are clobbered, it could be more
// efficient to use mfocrf to selectively save just those fields."
So we are going to fix it.
About issue 2, we are trying to use llvm scheduling mechanism. We are
investigating how to do this, any recommendation please let us know.
Update status:
[SU0] mfcr 12
[SU1] mflr 0
[SU2] std 0, 16(1)
[SU3] stw 12, 8(1)
[SU4] stdu 1, -176(1)
We are able to reorder these instructions in a better sequency. Our current
methods:
1. Move stack allocation [SU4] to the end of basic block (before terminate
instruction), because we need more space to be able to reorder other
instructions.
2. Add 'mflr' scheduling data in PPCScheduleP8.td, we told scheduler that
'mflr' and 'mfcr' use the same resource, and both occupy execution pipe 4
cycles, so post-RA-sched will try to separate them.
3. Set 'mflr' to "no side effect" property, so scheduler can reorder it freely.
4. Add memory operand information for [SU2], [SU3], so scheduler can reorder
them without worrying about memory dependency.
By the way, happy new year 2016 @@ ya
A typical function prloluge/epilogue on ppc64 looks like
mflr r0 // P1
std r0,16(r1) // P2
...
ld r0, 16(r1) // E1
mtlr r0 // E2
blr // E3
P2 depends on P1, E2 depends E1 and E3 depends on E2. Following code sequence
can completely remove the dependences (P2, P1) (E2, E1),
std r14, 16(r1)
mflr r14
...
mtlr r14
ld r14, 16(r1)
blr
One problem in this code sequence is the stack frame layout is changed, it can
impact unwinding (and exceptions). Correct dwarf information can solve this
problem(at least in spec2006).
I tested the following patch on spec2006 c/c++ applications. Unfortunately it
doesn't bring obvious performance improvement. Only 3~4s improvement for astar,
it is less than 1%, no changes to other applications.
Index: lib/Target/PowerPC/PPCFrameLowering.cpp
===================================================================
--- lib/Target/PowerPC/PPCFrameLowering.cpp (revision 262420)
+++ lib/Target/PowerPC/PPCFrameLowering.cpp (working copy)
@@ -851,8 +851,13 @@
.addReg(SPReg);
}
+ int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14;
if (MustSaveLR)
- BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
+ //BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
+ BuildMI(MBB, MBBI, dl, StoreInst)
+ .addReg(SavedLRReg)
+ .addImm(LROffset)
+ .addReg(SPReg);
if (MustSaveCR &&
!(SingleScratchReg && MustSaveLR)) { // will only occur for PPC64
@@ -888,10 +893,11 @@
if (MustSaveLR)
// FIXME: On PPC32 SVR4, we must not spill before claiming the stackframe.
- BuildMI(MBB, MBBI, dl, StoreInst)
- .addReg(ScratchReg)
- .addImm(LROffset)
- .addReg(SPReg);
+ //BuildMI(MBB, MBBI, dl, StoreInst)
+ // .addReg(ScratchReg)
+ // .addImm(LROffset)
+ // .addReg(SPReg);
+ BuildMI(MBB, MBBI, dl, MFLRInst, SavedLRReg);
if (MustSaveCR &&
!(SingleScratchReg && MustSaveLR)) // will only occur for PPC64
@@ -1014,10 +1020,20 @@
if (MustSaveLR) {
// Describe where LR was saved, at a fixed offset from CFA.
unsigned Reg = MRI->getDwarfRegNum(LRReg, true);
- CFIIndex = MMI.addFrameInst(
- MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
+ unsigned gpr = MRI->getDwarfRegNum(SavedLRReg, true);
+ unsigned CFIIndex1 = MMI.addFrameInst(
+ MCCFIInstruction::createOffset(nullptr, gpr, LROffset));
BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
- .addCFIIndex(CFIIndex);
+ .addCFIIndex(CFIIndex1);
+ unsigned CFIIndex2 = MMI.addFrameInst(
+ MCCFIInstruction::createRegister(nullptr, Reg, gpr));
+ BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+ .addCFIIndex(CFIIndex2);
+
+ //CFIIndex = MMI.addFrameInst(
+ // MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
+ //BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+ // .addCFIIndex(CFIIndex);
}
}
@@ -1046,6 +1062,16 @@
for (unsigned I = 0, E = CSI.size(); I != E; ++I) {
unsigned Reg = CSI[I].getReg();
if (Reg == PPC::LR || Reg == PPC::LR8 || Reg == PPC::RM) continue;
+ if ((Reg == SavedLRReg) && MustSaveLR)
+ {
+ unsigned lrReg = MRI->getDwarfRegNum(LRReg, true);
+ int Offset = MFI->getObjectOffset(CSI[I].getFrameIdx());
+ unsigned CFIIndex2 = MMI.addFrameInst(
+ MCCFIInstruction::createOffset(nullptr, lrReg, Offset));
+ BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
+ .addCFIIndex(CFIIndex2);
+ continue;
+ }
// This is a bit of a hack: CR2LT, CR2GT, CR2EQ and CR2UN are just
// subregisters of CR2. We just need to emit a move of CR2.
@@ -1254,10 +1280,12 @@
.addReg(TempReg, getKillRegState(i == e-1));
}
+ int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14;
if (MustSaveLR)
- BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
- .addImm(LROffset)
- .addReg(SPReg);
+ BuildMI(MBB, MBBI, dl, MTLRInst).addReg(SavedLRReg);
+ //BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
+ // .addImm(LROffset)
+ // .addReg(SPReg);
if (MustSaveCR &&
!(SingleScratchReg && MustSaveLR)) // will only occur for PPC64
@@ -1289,7 +1317,10 @@
.addReg(TempReg, getKillRegState(i == e-1));
if (MustSaveLR)
- BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
+ //BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
+ BuildMI(MBB, MBBI, dl, LoadInst, SavedLRReg)
+ .addImm(LROffset)
+ .addReg(SPReg);
// Callee pop calling convention. Pop parameter/linkage area. Used for tail
// call optimization
Attached ppc-prolog-epilog-inst-schedule.patch
(16301 bytes, text/plain): Let post-RA-scheduler have more space to order instructions
Hi Carrot,
The patch is our current working. Actually we have made it few weeks ago.
This patch passed our internal full test, we have also benchmarked it.
Below is one of the example and result:
Original: With the patch (Reordered):
mflr 0 stdu 1, -240(1)
mfcr 12 mflr 0
std 0, 16(1) std 14, 96(1)
stw 12, 8(1) std 15, 104(1)
stdu 1, -240(1) std 16, 112(1)
std 14, 96(1) std 17, 120(1)
std 15, 104(1) std 18, 128(1)
std 16, 112(1) mfcr 12
std 17, 120(1) std 19, 136(1)
std 18, 128(1) std 30, 224(1)
std 19, 136(1) std 31, 232(1)
std 30, 224(1) std 0, 256(1)
std 31, 232(1) stw 12, 248(1)
#APP #APP
#NO_APP #NO_APP
bl bar bl bar
nop nop
ld 31, 232(1) ld 0, 256(1)
ld 30, 224(1) lwz 12, 248(1)
ld 19, 136(1) ld 31, 232(1)
ld 18, 128(1) ld 30, 224(1)
ld 17, 120(1) ld 19, 136(1)
ld 16, 112(1) ld 18, 128(1)
ld 15, 104(1) ld 17, 120(1)
ld 14, 96(1) ld 16, 112(1)
addi 1, 1, 240 ld 15, 104(1)
ld 0, 16(1) ld 14, 96(1)
lwz 12, 8(1) mtocrf 32, 12
mtocrf 32, 12 mtlr 0
mtlr 0 addi 1, 1, 240
blr blr
Our SPEC2006-INT/FP benchmark result didn't show obvious performance
improvement, that's why we hesitated about whether to fire the patch or not.
It looks like for out-of-order cpu which has big instruction window, plenty
execution pipe, plenty issue queue can decrease software instruction
shceduler's effort.
http://users.elis.ugent.be/~leeckhou/papers/hipeac08-eyerman.pdf
Anyway, at least this patch relax instruction order constraint in prologue and
epilogue, we will upload it later.
CY
Hi Carrot,
I uploaded the patch to phabricator:
http://reviews.llvm.org/D18030
I re-benchmark again, and I found it benefit performance, could you download
the patch and test on your machine, thanks!
I also uploaded a patch that try to use mfocrf in prologue when appropriate.
mfocrf has short latency compares to mfcr, we can get additional benefit from
using this instruction.
http://reviews.llvm.org/D17749
CY
Hi CY
I tested the patch http://reviews.llvm.org/D18030 on power8, following is the
execution time of several runs of mcf,
w/o w/ the patch
222.7s 216.8s
219.5s 219.0s
217.9s 221.8s
220.0s 222.4s
It seems there is no difference.
(In reply to comment #9)
> Hi CY
>
> I tested the patch http://reviews.llvm.org/D18030 on power8, following is
> the execution time of several runs of mcf,
>
> w/o w/ the patch
> 222.7s 216.8s
> 219.5s 219.0s
> 217.9s 221.8s
> 220.0s 222.4s
>
> It seems there is no difference.
Hi Carrot,
Could you please give me your llvm/clang revision?
Also, how did you test performance, e.g. for us, we will bind the running on
physical cpu0, and set cpu0 to performance governor:
sudo cpufreq-set -c 0 -g performance
taskset -c 0 runspec ...
// after the test, reset cpu0 to ondemand governor
sudo cpufreq-set -c 0 -g ondemand
CY
(In reply to comment #10)
>
> Hi Carrot,
>
> Could you please give me your llvm/clang revision?
>
> Also, how did you test performance, e.g. for us, we will bind the running on
> physical cpu0, and set cpu0 to performance governor:
>
> sudo cpufreq-set -c 0 -g performance
> taskset -c 0 runspec ...
>
> // after the test, reset cpu0 to ondemand governor
> sudo cpufreq-set -c 0 -g ondemand
>
> CY
My llvm/clang revision is 263000.
Our production machine doesn't have cpufreq-set. So I can only run the
benchmarks multiple times to get a stable result.
(In reply to comment #11)
> My llvm/clang revision is 263000.
>
> Our production machine doesn't have cpufreq-set. So I can only run the
> benchmarks multiple times to get a stable result.
Ok, I will re-run experiments and check performance score again.
(In reply to comment #12)
> (In reply to comment #11)
> > My llvm/clang revision is 263000.
> >
> > Our production machine doesn't have cpufreq-set. So I can only run the
> > benchmarks multiple times to get a stable result.
>
> Ok, I will re-run experiments and check performance score again.
Hi Carrot,
Sorry, you are right! The performance is not improved as you mentioned. It's a
frustrating result. We need some internal discussion for our next step. I'm
afraid software instruction scheduling in prologue/epilogue is not helpful for
such kind of out-of-order machine :(
CY
> Hi Carrot,
>
> Sorry, you are right! The performance is not improved as you mentioned. It's
> a frustrating result. We need some internal discussion for our next step.
> I'm afraid software instruction scheduling in prologue/epilogue is not
> helpful for such kind of out-of-order machine :(
>
> CY
I guess the problem is in a typical prolog/epilog most of the instructions are
low latency instructions except mflr, it means the instructions before mflr can
finish execution and retirement quickly, instructions after mflr must wait for
the retirement of mflr since retirement must be in order. So the latency of
mflr can't be hidden by scheduling.
I agree your comment.
I did some experiments on 400.perlbench, and I found "-fno-strict-aliasing" may
be the main reason that cause current llvm generates slower code than gcc.
In my case, I found there were extra flags (EXTRA_CFLAGS) used by 400.perlbench:
-fgnu89-inline -fno-strict-aliasing
I removed "-fno-strict-aliasing" when compiling by llvm and I got 1%
performance improvement.
My experiments:
Base options: -m64 -O3 -mcpu=power8 -funroll-loops -mno-vsx -ffast-math
-mrecip=!divd -fomit-frame-pointer
With -fno-strict-aliasing:
clang-3.9.0 (r266153) = 557.33 sec
at9.0 (gcc5.2.1) = 549.33 sec
w/o -fno-strict-aliasing:
clang-3.9.0 (r266153) = 551.33 sec
at9.0 (gcc5.2.1) = 563.67 sec
So, looks like "-fno-strict-aliasing" is the root cause.
Did you also use "-fno-strict-aliasing" in EXTRA_CFLAGS to compile
400.perlbench?
Hi Carrot,
Just let you know we are still discussing this issue internally. And it might take times to have conclusion.
CY
ppc-prolog-epilog-inst-schedule.patch
(16301 bytes, text/plain)