avx512 - codegen regression

Quuxplusone commented 4 years ago


Bugzilla Link	PR46899
Status	NEW
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2020-07-29 13:13:35 -0700
Last modified on	2020-07-29 14:07:35 -0700
Version	trunk
Hardware	PC Linux
CC	craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

typedef unsigned char uint8_t;

static inline uint8_t x264_clip_uint8( int x )
{
  return x&(~63) ? (-x)>>7 : x;
}

void mc_weight( uint8_t *__restrict dst, uint8_t *__restrict src)
{
    for( int x = 0; x < 16; x++ )
        dst[x] = x264_clip_uint8(src[x]);
}

Clang 9 -Ofast -march=skylake-avx512

mc_weight(unsigned char*, unsigned char*):
        vmovdqu xmm0, xmmword ptr [rsi]
        vpmovzxbd       zmm1, xmm0
        vpcmpnleub      k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm0, xmm0, xmm0
        vpsubd  zmm0, zmm0, zmm1
        vpsrld  zmm1 {k1}, zmm0, 7
        vpmovdb xmmword ptr [rdi], zmm1
        vzeroupper
        ret

Clang trunk -Ofast -march=skylake-avx512

mc_weight(unsigned char*, unsigned char*):
        vmovdqu xmm0, xmmword ptr [rsi]
        vpbroadcastq    xmm1, qword ptr [rsi + 8]
        vpmovzxbd       ymm1, xmm1
        vpcmpltub       k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm2, xmm2, xmm2
        vpsubd  ymm3, ymm2, ymm1
        vpsrld  ymm3, ymm3, 7
        kshiftrw        k2, k1, 8
        vmovdqa32       ymm3 {k2}, ymm1
        vpmovzxbd       ymm0, xmm0
        vpsubd  ymm1, ymm2, ymm0
        vpsrld  ymm1, ymm1, 7
        vpmovdb xmm2, ymm3
        vmovdqa32       ymm1 {k1}, ymm0
        vpmovdb xmm0, ymm1
        vpunpcklqdq     xmm0, xmm0, xmm2
        vmovdqu xmmword ptr [rdi], xmm0
        vzeroupper
        ret

Godbolt: https://godbolt.org/z/P4bWYb

Quuxplusone commented 4 years ago

Sligh regression for -Ofast -mavx512f too

Clang 9
...
        vpsubd  zmm1, zmm1, zmm0
        vpsrld  zmm0 {k1}, zmm1, 7
        vpmovdb xmmword ptr [rdi], zmm0
        vzeroupper
...

Clang 9 trunk

...
        vpsubd  zmm0, zmm0, zmm1
        vpsrld  zmm0, zmm0, 7
        vmovdqa32       zmm0 {k1}, zmm1
        vpmovdb xmmword ptr [rdi], zmm0
        vzeroupper
...

https://godbolt.org/z/1jbTW5

Quuxplusone commented 4 years ago

ICC 19 -Ofast -mavx512f

mc_weight(unsigned char*, unsigned char*):
        vpmovzxbd zmm19, XMMWORD PTR [rsi]                      #12.31
        vpandd    zmm16, zmm19, ZMMWORD PTR .L_2il0floatpacket.0[rip] #5.12
        vptestmd  k1, zmm16, zmm16                              #5.12
        vpxord    zmm17, zmm17, zmm17                           #12.31
        vpsubd    zmm18, zmm17, zmm19                           #12.31
        vpsrad    zmm19{k1}, zmm18, 7                           #12.31
        vpmovdb   XMMWORD PTR [rdi], zmm19                      #12.6
        vzeroupper                                              #13.1
        ret

https://godbolt.org/z/6nT6se

Quuxplusone commented 4 years ago

This looks like this is because -mprefer-vector-width=256 became the default for skylake-avx512 for 10.0 due to the frequency drop for avx512 instructions. -mprefer-vector-width=512 restores the original code.

Quuxplusone commented 4 years ago

A random minor issue:

mc_weight(unsigned char*, unsigned char*):
        vmovdqu xmm0, xmmword ptr [rsi]
        vpbroadcastq    xmm1, qword ptr [rsi + 8]
        vpmovzxbd       ymm1, xmm1

The vpbroadcastq is superfluous and the vpmovzxbd should fold the load.

Quuxplusone / LLVMBugzillaTest

avx512 - codegen regression #45868