Nested loop unroll bug on skylake avx512

Quuxplusone commented 4 years ago


Bugzilla Link	PR44544
Status	NEW
Importance	P normal
Reported by	Jakob Schwarz (jakobschwarz@yahoo.com)
Reported on	2020-01-14 08:14:35 -0800
Last modified on	2020-02-05 18:04:10 -0800
Version	trunk
Hardware	PC Linux
CC	blitzrakete@gmail.com, craig.topper@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

I think, I found a bug in clang, tested on local machines and on godbolt with
clang 7, 8 and 9. It only occurs with -O3 optimization and -march=skylake-
avx512. With GCC and Intel the code produces correct results.

Disabling loop nesting in the example is also fine with Clang. The code should
return just zeros in the cout print.

#include <iostream>

int main(int argc, char *argv[])
{
    static constexpr uint32_t mult = 4u;
    static constexpr uint64_t MASK_H = 0x000000000000FFFFull;
    uint64_t arr2[16][4];
    for(auto i=0; i<16; i++) for(auto j=0; j<4; j++) arr2[i][j] = ~uint64_t(0);

    uint64_t* mm =&arr2[0][0];
    for(uint32_t zz=0; zz<16; zz++){
// #pragma clang loop unroll(disable)
        for(uint32_t yy=0; yy<16; yy++){
            const uint32_t ID   = yy+zz*16;
            const uint64_t mask = ~(MASK_H<<(ID%mult*16));
            mm[ID/mult] &= mask;
        }
    }
    for(auto i=0; i<16; i++) {
        for(auto j=0; j<4; j++) std::cout << arr2[i][j] << " ";
        std::cout << std::endl;
    }
    return 0;
}

Quuxplusone commented 4 years ago

Are you testing that on actual -march=skylake-avx512 hardware?
If not, i would say it is very likely that it made use of AVX512 instructions
like you asked it to, but executing them on AVX512-less machine leads to
this unexpected behavior.

Quuxplusone commented 4 years ago

Speculatively moving to x86 component (if there is a bug, it's most likely in vector codegen).

Quuxplusone commented 4 years ago

Yes, it first occured with -march=native on my skylake-AVX512 machine.

Quuxplusone commented 4 years ago

This appears to be a vectorizer issue. I'm seeing a wide 64 x i64 load followed by 6 gather/scatter pairs. And then a 64xi64 store using data that originated from the 64xi64 load with some masking applied. This store fully clobbered the updates the 6 scatters because it used the load value from before the scatters. I'll try to put together some more information tomorrow.

Quuxplusone / LLVMBugzillaTest

Nested loop unroll bug on skylake avx512 #43514