Missed optimization: inefficient codegen for __builtin_addc

Quuxplusone commented 6 years ago


Bugzilla Link	PR36243
Status	CONFIRMED
Importance	P enhancement
Reported by	Marcin Kościelnicki (koriakin@0x04.net)
Reported on	2018-02-05 14:34:12 -0800
Last modified on	2019-12-27 06:59:01 -0800
Version	trunk
Hardware	PC All
CC	andyg1001@hotmail.co.uk, dave@znu.io, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, nemanja.i.ibm@gmail.com, qshanz@cn.ibm.com, spatel+llvm@rotateright.com
Fixed by commit(s)	rG257acbf6aee9
Attachments
Blocks
Blocked by
See also	PR39464

clang has __builtin_addc* functions, which are supposed to emit hardware add-
with-carry instructions.  However, there is no corresponding intrinsic on LLVM
side, so clang emits a sequence of instructions that is only recognized and
folded to a single hw instruction in two cases:

- carry input is 0, or
- carry output is unused

This means that any carry chains longer than 2 result in inefficient code:

void add3(
    unsigned long long *restrict a,
    unsigned long long *restrict b,
    unsigned long long *restrict c
) {
    unsigned long long cf = 0;
    c[0] = __builtin_addcll(a[0], b[0], cf, &cf);
    c[1] = __builtin_addcll(a[1], b[1], cf, &cf);
    c[2] = __builtin_addcll(a[2], b[2], cf, &cf);
}

Compiles to:

add3:                                   # @add3
        .cfi_startproc
# BB#0:
        movq    (%rdi), %rax
        movq    (%rsi), %r8
        leaq    (%rax,%r8), %rcx
        movq    %rcx, (%rdx)
        movq    8(%rdi), %rcx
        addq    8(%rsi), %rcx
        setb    %r9b
        addq    %r8, %rax
        adcq    $0, %rcx
        setb    %al
        orb     %r9b, %al
        movzbl  %al, %eax
        movq    %rcx, 8(%rdx)
        movq    16(%rsi), %rcx
        addq    16(%rdi), %rcx
        addq    %rax, %rcx
        movq    %rcx, 16(%rdx)
        retq

I suppose we're going to need a new target-independent generic intrinsic,
say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a
corresponding one for subtraction as well) and map it to ISD::ADDE /
ISD::ADDCARRY.

Quuxplusone commented 6 years ago

I have also found this.  Here is the code I am using:

    #include <stdint.h>

    struct UInt96
      {
      uint32_t d2, d1, d0;
      };

    void ADD96_v1(const UInt96& s, UInt96& d)
      {
      uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
      d.d0 = sum; sum >>= 32;
      sum += uint64_t(d.d1) + uint64_t(s.d1);
      d.d1 = sum; sum >>= 32;
      sum += uint64_t(d.d2) + uint64_t(s.d2);
      d.d2 = sum;
      }

    void ADD96_v2(const UInt96& s, UInt96& d)
      {
      uint32_t carry;

      d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
      d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
      d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
      }

ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it
using __builtin_addc.  While the second version works, I am surprised that it
is less optimal than the first version: using the Godbolt Compiler Explorer,
the generated ARM code contains an additional three seemingly unnecessary
instructions; worse the PPC code utilises branch operations!

My compiler options are:

-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2

and:

-target powerpc-unknown-linux-gnu -mcpu=603e -O2

It seems that the __builtin_addc* functions don't really have an advantage; is
that true?

Quuxplusone commented 5 years ago

FYI – https://reviews.llvm.org/D70079

Quuxplusone commented 5 years ago

Current Codegen: https://godbolt.org/z/cVSccY

Quuxplusone commented 4 years ago

rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine

x86/arm code works as expected

ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY

Andy - maybe open a PPC specific bug?

Quuxplusone commented 4 years ago

Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.

Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?

Quuxplusone commented 4 years ago

We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.

Quuxplusone / LLVMBugzillaTest

Missed optimization: inefficient codegen for __builtin_addc #35216