Open Quuxplusone opened 6 years ago
Bugzilla Link | PR38665 |
Status | NEW |
Importance | P enhancement |
Reported by | Jörg Sonnenberger (joerg@NetBSD.org) |
Reported on | 2018-08-21 16:17:20 -0700 |
Last modified on | 2018-08-22 16:34:43 -0700 |
Version | 7.0 |
Hardware | PC Linux |
CC | efriedma@quicinc.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, richard-llvm@metafoo.co.uk |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
I don't see any evidence that Clang and GCC are giving different results. Can
you give an example?
The semantics of these builtins are:
* assign the 2's complement result of the operation to the result
* return 0 if the 2's complement result equals the mathematical result and 1 otherwise
Now, Clang happens to implement this with a 65-bit signed multiplication (with
an overflow check) followed by a check that the 65-bit result fits in a 64-bit
ptrdiff_t, and LLVM happens to not optimize that to something sensible and
instead emits a 128-bit multiply-with-overflow computation, which is why you
get a __muloti4, but the computation performed appears to be correct in both
compilers as far as I can see.
It does appear to be a bug in either Clang or LLVM (or maybe both) that Clang
is generating low-quality IR for this construct and LLVM is not able to replace
it with high-quality IR. We should generate something like:
%0 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %a, i64 %b)
%1 = extractvalue { i64, i1 } %0, 0
%2 = extractvalue { i64, i1 } %0, 1
%cmp = icmp slt i64 %1, 0
%3 = or i1 %2, %cmp
... for which the x86 backend emits good code.
Transforming
smul.with.overflow.iN+1(A, B) -> zext(umul.with.overflow.iN(trunc(A), trunc(B)))
(when both A and B are known to have a clear high bit) would probably fix this.
(To fully clean up the IR currently coming out of the middle-end, we would also
need LLVM to optimize
icmp slt (add nuw (zext i64 %A to i65), i65 0x8000000000000000), i65 0
to
icmp slt i64 %A, 0
or similar. But cleaning up the i65 operations would probably avoid LLVM
generating the add in the first place.)
I'm pretty sure the intrinsics generate correct results; we did have a bug at one point, but it was fixed (see https://reviews.llvm.org/D41717).
clang has a special case for mixed-sign multiplication to generate more efficient code; see isSpecialMixedSignMultiply in clang/lib/CodeGen/CGBuiltin.cpp. But it only triggers in cases where the operands have different signs, not the case where both operands have a different sign from the result. Probably could be extended.
Regardless of what the IR input looks like, the backend shouldn't generate a call to __muloti4 in environments where it isn't available.
I tried changing Clang to produce the umul.with.overflow.i64 from comment#1 instead of the smul.with.overflow.i65 we currently produce. That removes the __muloti4 call, but as I feared, LLVM still produces bad code for the "trunc i63/sext i64/icmp eq" following the intrinsic.
I think it makes sense for Clang to try a little harder to avoid producing unnecessary i65 operations in the first place (especially at -O0), but it would also seem sensible to add the missing instcombines so that LLVM can clean this up by itself.
I believe we already have a separate bug open for either preventing the backend from emitting calls to __muloti4 or always linking against compiler-rt to pick up the builtins that libgcc lacks.