Missed optimization : Loop unrolling causes inefficient use of `adc` as compared to loop rolling


Bugzilla Link	PR44460
Status	NEW
Importance	P enhancement
Reported by	Madhur Chauhan (madhur4127@gmail.com)
Reported on	2020-01-05 00:01:49 -0800
Last modified on	2021-08-14 10:20:27 -0700
Version	9.0
Hardware	PC Linux
CC	craig.topper@gmail.com, florian_hahn@apple.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, neeilans@live.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Consider emulating 192-bit integer using a 128-bit integer and a 64-bit integer. In the code sample this emulated integer is used to compute dot product of two uint64_t vectors of length N.

// function to compute dot product of two vectors
using u128 = unsigned __int128;
const int N = 2048;
uint64_t a[N], b[N];
u128 sum = 0;
uint64_t overflow = 0;
for(int i=0;i<N;++i){
    u128 prod = (u128) a[i] * (u128) b[i];
    sum += prod;
    // gcc branches, clang just uses: adc overflow, 0
    overflow += sum<prod;
}

To check for overflow in 128-bit and subsequently propagate the carry to overflow, adc can be used. This idiom works well when loops are rolled (no-unroll).

clang++ -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops

.LBB0_1: # =>This Inner Loop Header: Depth=1 mov rax, qword ptr [rsi + 8rcx] mul qword ptr [rdi + 8rcx] add r10, rax adc r9, rdx

    adc     r11, 0                  # This is efficient form

    inc     rcx
    cmp     rcx, 2048
    jne     .LBB0_1
    mov     qword ptr [r8], r11
    mov     rax, r10
    mov     rdx, r9
    ret

But when loops are unrolled this efficient ASM degrades to mov; setb; movzx; add; Instead of just adc reg, 0.

clang++ -O3 -Wall -Wextra -march=broadwell # fno-unroll-loops is absent

.LBB0_1: # =>This Inner Loop Header: Depth=1 mov rax, qword ptr [rsi + 8rbx] mov r10, qword ptr [rsi + 8rbx + 8] mul qword ptr [rdi + 8rbx] mov r11, rdx mov r14, rax add r14, r9 adc r11, rcx setb bpl mov rax, r10 mul qword ptr [rdi + 8rbx + 8] mov rcx, rax mov r9, rdx movzx ebp, bpl add rcx, r14 adc r9, r11 adc rbp, r15 mov rax, qword ptr [rsi + 8rbx + 16] mul qword ptr [rdi + 8rbx + 16] mov r10, rdx mov r11, rax add r11, rcx adc r10, r9 setb cl mov rax, qword ptr [rsi + 8rbx + 24] mul qword ptr [rdi + 8rbx + 24] movzx r15d, cl mov r9, rax add r9, r11 mov rcx, rdx adc rcx, r10 adc r15, rbp add rbx, 4 cmp rbx, 2048 jne .LBB0_1

For complete source code, here is the godbolt link: https://godbolt.org/z/tT7Z2H

Source of this discussion is the stackoverflow Q&A: https://stackoverflow.com/q/59575408/8199790

Quuxplusone / LLVMBugzillaTest

Missed optimization : Loop unrolling causes inefficient use of `adc` as compared to loop rolling #43430