Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

Quuxplusone commented 4 years ago


Bugzilla Link	PR45808
Status	NEW
Importance	P enhancement
Reported by	Nemo Publius (nemo@self-evident.org)
Reported on	2020-05-05 13:03:37 -0700
Last modified on	2020-07-01 14:18:55 -0700
Version	trunk
Hardware	PC Linux
CC	craig.topper@gmail.com, florian_hahn@apple.com, htmldeveloper@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)	rGfe6f5ba0bffd, rGb8a725274c22, rG3521ecf1f8a3
Attachments
Blocks
Blocked by
See also	PR42653

Related: Bug #45806 and https://stackoverflow.com/q/61601902/

I am trying to produce an AVX2 mask with all-ones in the lower lane and all-
zeroes in the upper lane of a YMM register. The code I am using is:

    __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));

This should produce a single instruction like `vpcmpeqd %xmm0,%xmm0,%xmm0`, but
Clang insists on putting the value into memory and loading it.

However, Clang insists on putting this into memory and loading it.

The behavior in context is even more odd:

    __m256i minmax(__m256i v1, __m256i v2)
    {
        __m256i comp = _mm256_cmpgt_epi64(v1, v2);
        __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));
        return _mm256_blendv_epi8(v2, v1, _mm256_xor_si256(comp, mask));
    }

This goes through a bunch of contortions with extracting, shifting, and
expanding 128-bit registers when I feel like the result I want is pretty
straightforward.

Godbolt example: https://gcc.godbolt.org/z/GPhJ6s

Quuxplusone commented 4 years ago

define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
  %3 = icmp sgt <4 x i64> %0, %1
  %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
  %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
  ret <4 x i64> %5
}

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #1)
> define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
>   %3 = icmp sgt <4 x i64> %0, %1
>   %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

>   %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
>   ret <4 x i64> %5
> }

Quuxplusone commented 4 years ago

We might need to improve PromoteMaskArithmetic to better handle selects.

CC'ing Florian who did a load of improvements in D72524.

Quuxplusone commented 4 years ago

(In reply to Roman Lebedev from comment #2)
> So basically the code that is handling materialization of all-ones constants
> as pcmpeq needs to be taught that if lower portion is all-ones and the rest
> is zeros, it might still be profitable, i'm guessing?

[Bug #42653] discusses something similar for rematerializable lower 'allones'
subvector masks once we avoid the unnecessary packss/pmovsx

Quuxplusone commented 4 years ago

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
        .quad   1                       # 0x1
        .quad   1                       # 0x1
        .quad   0                       # 0x0
        .quad   0                       # 0x0
_Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
        vpcmpgtq        %ymm1, %ymm0, %ymm2
        vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
        vpsllq  $63, %ymm2, %ymm2
        vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
        retq

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #5)
> rGfe6f5ba0bffd - added test case
> rGb8a725274c22 - fixed PACKSS promotion issues
>
> Current AVX2 Codegen:
>
> .LCPI0_0:
>         .quad   1                       # 0x1
>         .quad   1                       # 0x1
>         .quad   0                       # 0x0
>         .quad   0                       # 0x0
> _Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
>         vpcmpgtq        %ymm1, %ymm0, %ymm2
produces either -1 or 0
>         vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
Inverts lowest bit only
>         vpsllq  $63, %ymm2, %ymm2
moves lowest bit into highest bit
>         vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
uses highest bit to control blending

Can't we get rid of the vpsllq by using -1 instead of 1 in xor?

>         retq

Quuxplusone commented 4 years ago

That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)

Quuxplusone commented 4 years ago

Initial patch: https://reviews.llvm.org/D82257

This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).

Quuxplusone commented 4 years ago

Current AVX2 Codegen:

.LCPI0_0:
        .quad   -1                      # 0xffffffffffffffff
        .quad   -1                      # 0xffffffffffffffff
        .quad   0                       # 0x0
        .quad   0                       # 0x0
_Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
        vpcmpgtq        %ymm1, %ymm0, %ymm2
        vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
        vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
        retq

Quuxplusone / LLVMBugzillaTest

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #44778