Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed optimization when summing every second number in unsigned #51560

Open Quuxplusone opened 2 years ago

Quuxplusone commented 2 years ago
Bugzilla Link PR52593
Status NEW
Importance P enhancement
Reported by Alexey Dmitriev (llvm-bugs@admitriev.name)
Reported on 2021-11-23 12:39:59 -0800
Last modified on 2021-11-23 14:27:58 -0800
Version trunk
Hardware PC All
CC florian_hahn@apple.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Suppose we have a function

template<class T>
double foo(const std::vector<double>& v) {
    double sum = 0;
    T size = v.size();
    for (T i = 0; i < size; i += 2) {
        sum += v[i];
    }
    return sum;
}

and call it as foo<uint64_t> and foo<int64_t>.

The signed version is 1.7 times faster (both with libstdc++ and libc++)
https://quick-bench.com/q/EQxNHHtqi9497mG3ovtfG7fQa6s

(Note that gcc generates code for both signed and unsigned version
approximately as fast as signed clang version https://quick-
bench.com/q/lDXyfset7rLDmOG2mGxjVb1h0Zw)

Indeed we can se on godbolt that code generated for different type is
different: https://godbolt.org/z/Kjv6q7dsr
But seemingly there's no reason why they should be different.

One can argue that one possible difference is we can assume no overflow for
signed version. But actually we can assume no overflow in unsigned version as
well at least for three reasons:
1) if there's overflow, then the loop is infinite without any side-effects
which is UB
2)  the size guaranteed to be less then v.max_size() which is way less then 2
** 64
3) size is calculated as difference of 2 pointers to 8byte type, so it's
bounded by 2**61 or so
Note that I tried few ways to use __builtin_assume() to "proof" to the compiler
there's no overflow which didn't help.