Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed simplifications for PTEST patterns #41005

Closed Quuxplusone closed 4 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR42035
Status RESOLVED FIXED
Importance P enhancement
Reported by David Bolvansky (david.bolvansky@gmail.com)
Reported on 2019-05-27 10:21:31 -0700
Last modified on 2020-05-28 06:13:24 -0700
Version trunk
Hardware PC Linux
CC craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s) rG410667f1b74c
Attachments
Blocks
Blocked by
See also PR38788, PR38522
As mentioned in https://mattkretz.github.io/2019/05/27/vectorized-conversion-
from-utf8-using-stdx-simd.html and related created issue on GCC's bugzilla,
there are opportunities to simplify PTEST patterns and neither GCC nor Clang
can optimize it currently.

Test cases from GCC's bugzilla:
using __v16qu [[gnu::vector_size(16)]] = unsigned char;

// test sign bit [ptest 0x808080..., x]
bool bad(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f);
}

bad(unsigned char __vector(16)): // test for the sign bit, we can optimize to
(with 0x808080... at LC0)
        vpxor   xmm1, xmm1, xmm1
        vpcmpgtb        xmm0, xmm1, xmm0
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm1, xmm0
        sete    al
        ret

// test for zero [ptest x, x]
bool bad2(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x == 0);
}

bad2(unsigned char __vector(16)): //  equivalent to testing scalars for 0
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpeqb        %xmm1, %xmm0, %xmm0
        vpcmpeqd        %xmm1, %xmm1, %xmm1
        vptest  %xmm0, %xmm1
        sete    %al
        ret

// test for certain bits [ptest x, k]
bool bad3(__v16qu x, __v16qu k) {
  return __builtin_ia32_ptestz128(~__v16qu(), (x & k) == 0);
}

bad3(unsigned char __vector(16), unsigned char __vector(16)):
        vpand   xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm1, xmm0
        sete    al
        ret

With the above transformation we already get PTEST(x&k, x&k) which can
consequently be reduced to PTEST(x, k):
Quuxplusone commented 5 years ago

Current codegen https://godbolt.org/z/7x53rr

Quuxplusone commented 4 years ago
(In reply to David Bolvansky from comment #0)
> // test for zero [ptest x, x]
> bool bad2(__v16qu x) {
>   return __builtin_ia32_ptestz128(~__v16qu(), x == 0);
> }
>
> bad2(unsigned char __vector(16)): //  equivalent to testing scalars for 0
>         vpxor   %xmm1, %xmm1, %xmm1
>         vpcmpeqb        %xmm1, %xmm0, %xmm0
>         vpcmpeqd        %xmm1, %xmm1, %xmm1
>         vptest  %xmm0, %xmm1
>         sete    %al
>         ret

I'm not sure this proposal is correct.

__builtin_ia32_ptestz128(~__v16qu(), x == 0) performs:

if any of the elements of <16 x i8> x == 0 (elt result == -1) then return 0
if all of the elements of <16 x i8> x != 0 (elt result ==  0) then return 1

__builtin_ia32_ptestz128(x, x) performs:

if any of the 128 bits of x is true then return 0
if all of the 128 bits of x is false then return 1
Quuxplusone commented 4 years ago
(In reply to David Bolvansky from comment #0)
> using __v16qu [[gnu::vector_size(16)]] = unsigned char;
>
> // test sign bit [ptest 0x808080..., x]
> bool bad(__v16qu x) {
>   return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f);
> }

For sign bit only testing we're probably better off using movmsk:

https://godbolt.org/z/ziJUst
Quuxplusone commented 4 years ago

Yeah, movmsk should be better choice.

Quuxplusone commented 4 years ago

Resolving - converting the sign bit variant to movmsk landed at D80563/rG410667f1b74c

As discussed, nothing else seems to viable.