__builtin_mul_overflow with mixed sign gives different results compared to GCC

Quuxplusone commented 6 years ago


Bugzilla Link	PR38665
Status	NEW
Importance	P enhancement
Reported by	Jörg Sonnenberger (joerg@NetBSD.org)
Reported on	2018-08-21 16:17:20 -0700
Last modified on	2018-08-22 16:34:43 -0700
Version	7.0
Hardware	PC Linux
CC	efriedma@quicinc.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, richard-llvm@metafoo.co.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Try to build:

#include <stddef.h>

int f(size_t a, size_t b) {
   ptrdiff_t x;
   return __builtin_mul_overflow(a, b, &x);
}

with clang and gcc for x86_64.

(1) GCC seems to interpret this as uint64_t x uint64_t -> uint64_t
multiplication
(2) Clang seems to widen the types and goes for int128_t instead (__muloti4).

This difference in interpretation seems to break the memory overflow checks in
GNU m4.

Quuxplusone commented 6 years ago

I don't see any evidence that Clang and GCC are giving different results. Can
you give an example?

The semantics of these builtins are:

 * assign the 2's complement result of the operation to the result
 * return 0 if the 2's complement result equals the mathematical result and 1 otherwise

Now, Clang happens to implement this with a 65-bit signed multiplication (with
an overflow check) followed by a check that the 65-bit result fits in a 64-bit
ptrdiff_t, and LLVM happens to not optimize that to something sensible and
instead emits a 128-bit multiply-with-overflow computation, which is why you
get a __muloti4, but the computation performed appears to be correct in both
compilers as far as I can see.

It does appear to be a bug in either Clang or LLVM (or maybe both) that Clang
is generating low-quality IR for this construct and LLVM is not able to replace
it with high-quality IR. We should generate something like:

  %0 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %a, i64 %b)
  %1 = extractvalue { i64, i1 } %0, 0
  %2 = extractvalue { i64, i1 } %0, 1
  %cmp = icmp slt i64 %1, 0
  %3 = or i1 %2, %cmp

... for which the x86 backend emits good code.

Transforming

  smul.with.overflow.iN+1(A, B) -> zext(umul.with.overflow.iN(trunc(A), trunc(B)))

(when both A and B are known to have a clear high bit) would probably fix this.
(To fully clean up the IR currently coming out of the middle-end, we would also
need LLVM to optimize

  icmp slt (add nuw (zext i64 %A to i65), i65 0x8000000000000000), i65 0

to

  icmp slt i64 %A, 0

or similar. But cleaning up the i65 operations would probably avoid LLVM
generating the add in the first place.)

Quuxplusone commented 6 years ago

I'm pretty sure the intrinsics generate correct results; we did have a bug at one point, but it was fixed (see https://reviews.llvm.org/D41717).

clang has a special case for mixed-sign multiplication to generate more efficient code; see isSpecialMixedSignMultiply in clang/lib/CodeGen/CGBuiltin.cpp. But it only triggers in cases where the operands have different signs, not the case where both operands have a different sign from the result. Probably could be extended.

Regardless of what the IR input looks like, the backend shouldn't generate a call to __muloti4 in environments where it isn't available.

Quuxplusone commented 6 years ago

I tried changing Clang to produce the umul.with.overflow.i64 from comment#1 instead of the smul.with.overflow.i65 we currently produce. That removes the __muloti4 call, but as I feared, LLVM still produces bad code for the "trunc i63/sext i64/icmp eq" following the intrinsic.

I think it makes sense for Clang to try a little harder to avoid producing unnecessary i65 operations in the first place (especially at -O0), but it would also seem sensible to add the missing instcombines so that LLVM can clean this up by itself.

I believe we already have a separate bug open for either preventing the backend from emitting calls to __muloti4 or always linking against compiler-rt to pick up the builtins that libgcc lacks.

Quuxplusone / LLVMBugzillaTest

__builtin_mul_overflow with mixed sign gives different results compared to GCC #37638