Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Avoid generating branch in branchless code #40080

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR41110
Status NEW
Importance P enhancement
Reported by David Bolvansky (david.bolvansky@gmail.com)
Reported on 2019-03-17 03:46:18 -0700
Last modified on 2019-08-05 14:44:32 -0700
Version trunk
Hardware PC Linux
CC andrea.dibiagio@gmail.com, craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
struct edge {
    int from;
    int to;
};

int edge_to_rank(struct edge e, size_t n) {
    return e.from < e.to ? e.to - 1 : e.from - 1 + n / 2;
}

Clang -O3:
edge_to_rank(edge, unsigned long): # @edge_to_rank(edge, unsigned long)
  mov rax, rdi
  movabs rcx, -4294967296
  mov rdx, rdi
  shr rdx, 32
  cmp eax, edx
  jge .LBB0_2
  add rax, rcx
  sar rax, 32
  ret
.LBB0_2:
  shl rax, 32
  add rax, rcx
  sar rax, 32
  shr rsi
  add rax, rsi
  ret

GCC -O3:
edge_to_rank(edge, unsigned long):
  mov rax, rdi
  sar rax, 32
  cmp edi, eax
  jge .L2
  dec eax
  ret
.L2:
  shr rsi
  lea eax, [rdi-1+rsi]
  ret

ICC -O3
edge_to_rank(edge, unsigned long):
        mov       rax, rdi                                      #9.21
        movsxd    rdx, edi                                      #9.48
        shr       rsi, 1                                        #9.56
        shr       rax, 32                                       #9.21
        movsxd    rcx, eax                                      #9.35
        dec       rcx                                           #9.35
        cmp       edi, eax                                      #9.21
        lea       r8, QWORD PTR [-1+rdx+rsi]                    #9.56
        cmovl     r8, rcx                                       #9.21
        mov       eax, r8d                                      #9.21
        ret

I think Clang trunk generates supoptimal code. GCC's output is small and nice
(maybe three op LEA should not be used), but with a branch...

ICC generates the code I would expect - branchless using cmov.

https://godbolt.org/z/z6uzsn
Quuxplusone commented 5 years ago

Yeah, we shouldn't be repeating code like that, although depending on whether we end up with a branch or cmov probably affects how/when we merge.

We have been finding cases where cmov isn't great if we have access to prof metadata that hints that its not 50:50 so gcc

Quuxplusone commented 5 years ago

I think we had decent IR at one point that should have gotten us close to the gcc output, but then canEvaluateSExtd got a little over excited and turned everything into 64-bit. Inserting the shl 32 and sar 32 to do so.

I think canEvaluateSExtd decided that because it found a truncate from i64 to i32 earlier in the computation, that it was a good idea to promote everything to i64. Even though the truncate had other users and we had to use shifts to still do the sign extension in the i64 value.

We might also need to teach InstCombine's visitTruncate to simplify (trunc (phi (sext X), Y)) to (phi X, (trunc Y)).

Quuxplusone commented 5 years ago
Probably doesn't do anything for the motivating problem, but this 64-bit
constant op:
  movabs rcx, -4294967296
  shl rax, 32
  add rax, rcx

...is bogus. We should be moving 'add' before 'shl' late in combining to get a
32-bit constant:
    leal    -1(%rdi), %eax
    shlq    $32, %rax

That likely requires carving out an exception using the
isDesirableToCommuteWithShift() hook. Otherwise, we'll have opposing transforms.

https://rise4fun.com/Alive/f9a
Quuxplusone commented 5 years ago
// (add (shl X, C2), C1) -> shl (add X, C1 >> C2), C2
  // iff (C1 & ((1<<C2)-1)) == 0
  ConstantSDNode *N1C = isConstOrConstSplat(N1);
  if (N0.getOpcode() == ISD::SHL && N1C &&
      TLI.isDesirableToCommuteWithShift(N0.getNode(), Level)) {
    ConstantSDNode *ShiftAmt = isConstOrConstSplat(N0.getOperand(1));
    unsigned BitWidth = N1C->getAPIntValue().getBitWidth();
    if (ShiftAmt && ((N1C->getAPIntValue() &
         (APInt::getSignMask(ShiftAmt->getSExtValue()).zextOrTrunc(BitWidth) -
          1)) == 0)) {
      SDValue Shift = DAG.FoldConstantArithmetic(ISD::SRA, DL, VT, N1.getNode(),
                                                 N0.getOperand(1).getNode());
      SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N0.getOperand(0), Shift);
      AddToWorklist(Shl0.getNode());
      return DAG.getNode(ISD::SHL, DL, VT, Add, N0.getOperand(1));
    }
  }

With
define i64 @ok1(i64 %x)  {
  %shl = shl i64 %x, 32
  %add = add i64 %shl, -4294967296
  ret i64 %add
}

Clang produces:
    leal    -1(%rdi), %eax
    shlq    $32, %rax

Sadly, with:
define i64 @ok2(i64 %x)  {
  %add = shl i64 %x, 5
  %shl = add i64 %add, 32
  ret i64 %shl
}

Clang/LLC goes to infinite loop (one pattern always reverses other one).

Any advice how to avoid / break infinite loop?
Quuxplusone commented 5 years ago
* AddToWorklist(Add.getNode());
Quuxplusone commented 5 years ago
(In reply to David Bolvansky from comment #4)
> Any advice how to avoid / break infinite loop?

I don't see a way to do this in general (DAGCombiner or X86ISelLowering)
because reversing the order would break a lot of existing folds. It might be
possible to add a late and limited match for this pattern to
X86ISelDAGToDAG.cpp (X86DAGToDAGISel class).