Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[ppc] Slow code for rotate shift and add #31789

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR32817
Status NEW
Importance P enhancement
Reported by Eric Christopher (echristo@gmail.com)
Reported on 2017-04-26 12:46:00 -0700
Last modified on 2017-07-05 22:07:21 -0700
Version trunk
Hardware PC All
CC carrot@google.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Seems to go about 25% slower than the gcc equivalent.

echristo@athyra ~/tmp> cat bar.c
unsigned int bar(int k, unsigned int x) {
  return ((1 + (x >> k)) + k);
}
echristo@athyra ~/tmp> clang -target powerpc64le-linux-gnu -S -O2 -o - bar.c
    .text
    .abiversion 2
    .file   "bar.c"
    .globl  bar
    .p2align    4
    .type   bar,@function
bar:                                    # @bar
.Lfunc_begin0:
# BB#0:
    srw 4, 4, 3
    add 3, 3, 4
    addi 3, 3, 1
    clrldi   3, 3, 32
    blr
    .long   0
    .quad   0
.Lfunc_end0:
    .size   bar, .Lfunc_end0-.Lfunc_begin0

    .ident  "clang version google3-trunk (trunk r301047)"
    .section    ".note.GNU-stack","",@progbits
echristo@athyra ~/tmp> powerpc64le-linux-gnu-gcc -S -O2 -o - bar.c
    .file   "bar.c"
    .abiversion 2
    .section    ".toc","aw"
    .section    ".text"
    .machine power8
    .align 2
    .p2align 4,,15
    .globl bar
    .type   bar, @function
bar:
    addi 9,3,1
    srw 4,4,3
    add 3,9,4
    rldicl 3,3,0,32
    blr
    .long 0
    .byte 0,0,0,0,0,0,0,0
    .size   bar,.-bar
    .ident  "GCC: (Google_crosstoolv18-gcc-4.9.x-powerpc64le-grtev4-linux-gnu)
4.9.x-google"
    .section    .note.GNU-stack,"",@progbits
Quuxplusone commented 7 years ago

It is expression reassociation problem. But in PPC backend:

1 PPCInstrInfo::getMachineCombinerPatterns enables it with -O3 only.

2 PPCInstrInfo::isAssociativeAndCommutative enables it for a very limited number of floating point operations.

Quuxplusone commented 7 years ago
Carrot, are you working on this? If so, can you please assign it to yourself?
If not, someone from our team can take a look at it. Thanks.
Quuxplusone commented 7 years ago

Feel free to take it.

Quuxplusone commented 7 years ago

We can certainly turn on reassociation to try to address this problem. But reassociation currently only considers instructions that have the same opcode. We'd need to make it understand the relation between reg/reg and reg/imm forms - for example PPC::ADD4/PPC::ADDI.

Quuxplusone commented 7 years ago
Actually MachineCombiner may have a very hard time with something like this due
to a couple of reasons:
- We only want to run it at -O3
- Different opcodes

It is actually reassociation on the SDAG (in the DAGCombiner) that messes this
up to begin with. It is done without regard to whether it will increase path
length.

I don't really see how this reduction in ILP is a gain on any processor.
This patch fixes this particular issue and if you think it is worth posting on
Phabricator, I can certainly do so (with this test case and possibly some for
the other affected nodes):

Index: lib/CodeGen/SelectionDAG/DAGCombiner.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/DAGCombiner.cpp    (revision 304396)
+++ lib/CodeGen/SelectionDAG/DAGCombiner.cpp    (working copy)
@@ -880,7 +880,9 @@ SDValue DAGCombiner::ReassociateOps(unsigned Opc,
           return DAG.getNode(Opc, DL, VT, N0.getOperand(0), OpNode);
         return SDValue();
       }
-      if (N0.hasOneUse()) {
+      // Reassociation will increase the path length if N0 and N1 use the same
+      // value. Don't reassociate in this case.
+      if (N0.hasOneUse() && !N0.getOperand(0).isOperandOf(N1.getNode())) {
         // reassoc. (op (op x, c1), y) -> (op (op x, y), c1) iff x+c1 has one
         // use
         SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N0.getOperand(0), N1);
@@ -900,7 +902,9 @@ SDValue DAGCombiner::ReassociateOps(unsigned Opc,
           return DAG.getNode(Opc, DL, VT, N1.getOperand(0), OpNode);
         return SDValue();
       }
-      if (N1.hasOneUse()) {
+      // Reassociation will increase the path length if N0 and N1 use the same
+      // value. Don't reassociate in this case.
+      if (N1.hasOneUse() && !N1.getOperand(0).isOperandOf(N0.getNode())) {
         // reassoc. (op x, (op y, c1)) -> (op (op x, y), c1) iff x+c1 has one
         // use
         SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N0, N1.getOperand(0));

P.S. The patch does not cause any failures on a double bootstrap build.
Quuxplusone commented 7 years ago
(In reply to Nemanja Ivanovic from comment #5)
> Actually MachineCombiner may have a very hard time with something like this
> due to a couple of reasons:
> - We only want to run it at -O3
> - Different opcodes
>
> It is actually reassociation on the SDAG (in the DAGCombiner) that messes
> this up to begin with. It is done without regard to whether it will increase
> path length.
>
> I don't really see how this reduction in ILP is a gain on any processor.
> This patch fixes this particular issue and if you think it is worth posting
> on Phabricator, I can certainly do so (with this test case and possibly some
> for the other affected nodes):
>
> Index: lib/CodeGen/SelectionDAG/DAGCombiner.cpp
> ===================================================================
> --- lib/CodeGen/SelectionDAG/DAGCombiner.cpp    (revision 304396)
> +++ lib/CodeGen/SelectionDAG/DAGCombiner.cpp    (working copy)
> @@ -880,7 +880,9 @@ SDValue DAGCombiner::ReassociateOps(unsigned Opc,
>            return DAG.getNode(Opc, DL, VT, N0.getOperand(0), OpNode);
>          return SDValue();
>        }
> -      if (N0.hasOneUse()) {
> +      // Reassociation will increase the path length if N0 and N1 use the
> same
> +      // value. Don't reassociate in this case.
> +      if (N0.hasOneUse() && !N0.getOperand(0).isOperandOf(N1.getNode())) {
>          // reassoc. (op (op x, c1), y) -> (op (op x, y), c1) iff x+c1 has
> one
>          // use
>          SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N0.getOperand(0),
> N1);
> @@ -900,7 +902,9 @@ SDValue DAGCombiner::ReassociateOps(unsigned Opc,
>            return DAG.getNode(Opc, DL, VT, N1.getOperand(0), OpNode);
>          return SDValue();
>        }
> -      if (N1.hasOneUse()) {
> +      // Reassociation will increase the path length if N0 and N1 use the
> same
> +      // value. Don't reassociate in this case.
> +      if (N1.hasOneUse() && !N1.getOperand(0).isOperandOf(N0.getNode())) {
>          // reassoc. (op x, (op y, c1)) -> (op (op x, y), c1) iff x+c1 has
> one
>          // use
>          SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N0,
> N1.getOperand(0));
>
> P.S. The patch does not cause any failures on a double bootstrap build.

The gcc version is faster is not because of one of (op x, (op y, c1)) and (op
(op x, y), c1) is faster than another one, the reason is it has shorter
dependence chain.
Quuxplusone commented 7 years ago
OK, I called it "path length" you call it "dependence chain". They're the same
thing aren't they? With that DAG combine, you can't do the shift and the add-
immediate in parallel. Disabling the DAG combine allows that to happen. The
code LLVM emits with that patch is:

srw 4, 4, 3
addi 3, 3, 1
add 3, 3, 4
clrldi   3, 3, 32
Quuxplusone commented 7 years ago
I see, your patch specifically test following case
((k + 1) + (x >> k))

It looks too constrained. If I have following DAG

((x >> k) + k) + 1

Or a little different expression

((m + 1) + (x >> k))

LLVM still can't handle them.
Quuxplusone commented 7 years ago
(In reply to Carrot from comment #8)
> I see, your patch specifically test following case
> ((k + 1) + (x >> k))
>
> It looks too constrained. If I have following DAG
>
> ((x >> k) + k) + 1
>
> Or a little different expression
>
> ((m + 1) + (x >> k))
>
> LLVM still can't handle them.

Fair enough, the last one still has the problem. Of course, completely
disabling this reassociation in the SDAG makes the problem go away on all of
them.

So I guess we should decide how to proceed with this. I don't think the
MachineCombiner has the infrastructure to easily undo this. We have a few
options:
- Modify what we do in the MachineCombiner to allow us to handle this
- Add a target hook to DAGCombiner to disable this on PPC
- Add a PPC DAG Combine to undo this
- Add a peephole to "shorten dependency chains"
- Something else I haven't thought of
Quuxplusone commented 7 years ago

Anyone have an opinion on which approach we want to pursue to solve this issue?