Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed optimization opportunity: transform comparisons with constants to a mask test #39582

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR40611
Status NEW
Importance P enhancement
Reported by Nikita Kniazev (nok.raven@gmail.com)
Reported on 2019-02-05 08:58:36 -0800
Last modified on 2019-11-19 10:35:20 -0800
Version trunk
Hardware All All
CC david.green@arm.com, llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR38002, PR40750
LLVM does not recognize that:

bool zot(unsigned i) {
    return i == 0 || i == 2;
}

Can be rewritten as:

bool zot_handopt(unsigned i) {
    return !(~2u & i);
}

https://rise4fun.com/Alive/PXsF

With more values the optimization even more useful:

bool ztfos(unsigned i) {
    return i == 0 || i == 2 || i == 4 || i == 6;
}
bool ztfos_handopt(unsigned i) {
    return !(~6u & i);
}

More complex example that greatly benefits from such optimization:

bool is_sign(char c) {
    return c == '-' || c == '+';
}
bool is_sign_handopt(char c) {
    unsigned char i = c;
    i -= '+';
    return !(~2u & i);
}

https://rise4fun.com/Alive/nKIp

GCC and MSVC recognize and optimize all the examples above
https://godbolt.org/z/_Pmy0S
The problem is not target specific (checked x86, x86-64, ppc32, ppc64, arm, and
arm64).
Quuxplusone commented 5 years ago
We have an IR transform for the more specific power-of-2-minus-1 constant
version of this pattern, but we're missing these more general ones:

Name: cmp eq 0
  %a = or i32 %x, C1
  %r = icmp eq i32 %a, C1
=>
  %b = and i32 %x, ~C1
  %r = icmp eq i32 %b, 0

Name: cmp ne 0
  %a = or i32 %x, C1
  %r = icmp ne i32 %a, C1
=>
  %b = and i32 %x, ~C1
  %r = icmp ne i32 %b, 0

https://rise4fun.com/Alive/9sl
Quuxplusone commented 5 years ago
Should be partly fixed with:
https://reviews.llvm.org/rL353313

The other 2 problems are independent and not as simple at first look.
Quuxplusone commented 5 years ago
I am sorry if I am wrong (I have zero knowledge in writing compiler
optimizations), but it seems to me that such optimization will solve the third
case:

Pre: isPowerOf2(C1 ^ C2)
  %a = icmp eq i8 %x, C1
  %b = icmp eq i8 %x, C2
  %r = or i1 %a, %b
=>
  %y = add i8 %x, -umin(C1, C2)
  %z = and i8 %y, ~abs(C1 - C2)
  %r = icmp eq i8 %z, 0

https://rise4fun.com/Alive/q6N
Quuxplusone commented 5 years ago
(In reply to Nikita Kniazev from comment #3)
> I am sorry if I am wrong (I have zero knowledge in writing compiler
> optimizations), but it seems to me that such optimization will solve the
> third case:
>
> Pre: isPowerOf2(C1 ^ C2)
>   %a = icmp eq i8 %x, C1
>   %b = icmp eq i8 %x, C2
>   %r = or i1 %a, %b
> =>
>   %y = add i8 %x, -umin(C1, C2)
>   %z = and i8 %y, ~abs(C1 - C2)
>   %r = icmp eq i8 %z, 0
>
> https://rise4fun.com/Alive/q6N

Nice! Thanks for solving that precondition generally.

So the tricky part is: where should we make this transform in the compiler?
For IR canonicalization (not necessarily optimization), it depends on how we
think the IR will interact with other IR transforms. I am leaning towards the
original (2 icmp) form there because we have a huge number of
transforms/analysis centered on icmp.

If we agree on that, then the transform will go in the backend (codegen). But
there's another question about that: is the transform universally good?

We know it does well on x86. How about AArch64 or other targets? If it's only
useful for x86, we can add a transform for x86 only.

If we have evidence that it looks better on multiple targets, then we want to
add the transform generically in DAGCombiner.

But if there is any target where it does not look better, then we have to
provide an opt-out to avoid the transform or reverse the transform.
Quuxplusone commented 5 years ago
I am not an expert in any particular architecture, but it seems to me that
multiple targets will benefit from the optimization (I do not know if there any
architecture that allows you compare a value to multiple constants). From what
I can test myself: GCC does that kind of an optimization at least for x86(-64)
and AArch. Other targets I can judge mostly by reduced amount of instructions
(MIPS, SPARC) and eliminated branches (PowerPC).

AArch32/AArch64 (+Thumb)  https://godbolt.org/z/38Z5Ci
MIPS/MIPS64               https://godbolt.org/z/6DCl1b
PowerPC/PowerPC64         https://godbolt.org/z/T-hH2K
SPARC/SPARCv9             https://godbolt.org/z/TDclNV
Quuxplusone commented 5 years ago
(In reply to Nikita Kniazev from comment #5)
> I am not an expert in any particular architecture, but it seems to me that
> multiple targets will benefit from the optimization (I do not know if there
> any architecture that allows you compare a value to multiple constants).
> From what I can test myself: GCC does that kind of an optimization at least
> for x86(-64) and AArch. Other targets I can judge mostly by reduced amount
> of instructions (MIPS, SPARC) and eliminated branches (PowerPC).
>
> AArch32/AArch64 (+Thumb)  https://godbolt.org/z/38Z5Ci
> MIPS/MIPS64               https://godbolt.org/z/6DCl1b
> PowerPC/PowerPC64         https://godbolt.org/z/T-hH2K
> SPARC/SPARCv9             https://godbolt.org/z/TDclNV

Thanks - that suggests that we can do this in DAGCombiner without too much
trouble. I'll start a patch and see how it looks.
Quuxplusone commented 5 years ago

Note that for every one of these boolean bit hacks, there should be a 'not' (DeMorgan) pattern that we need to match too. Something like this:

https://godbolt.org/z/PuiCOM

I need to go back to the earlier patch and see if we are handling that correctly.

Quuxplusone commented 5 years ago
It turns out we already have a target hook to decide when to do similar
transforms (should've remembered because I'm the one that added it :) ), so we
just need to match the right pattern.

I don't think this is the right pre-condition:
isPowerOf2(C1 ^ C2)
That doesn't match the pattern from the description where the constants are
43/45.

https://rise4fun.com/Alive/XslKj ?
Quuxplusone commented 5 years ago

The precondition should be isPowerOf2(umax(C0, C1) - umin(C0, C1))

https://rise4fun.com/Alive/vMDI

Quuxplusone commented 5 years ago
(In reply to Nikita Kniazev from comment #9)
> The precondition should be `isPowerOf2(umax(C0, C1) - umin(C0, C1))`
>
> https://rise4fun.com/Alive/vMDI

I think that's the same as what I wrote except I forced C0 to be the larger
constant to simplify things.
Quuxplusone commented 5 years ago

I think that's the same as what I wrote except I forced C0 to be the larger constant to simplify things.

Yeah. I just showed what I actually wanted to achieve in the original post. (I tried isPowerOf2(abs(C1 - C2)), but it did not work well, because it looks like abs output can be treated as negative value and isPowerOf2 works for negative values too).

I found GCC optimizations for the cases above. Here is the commit if you want too look at it https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=203627 (I do not know if it is allowed to borrow something from GCC, or even to post descriptions from the code here on bug tracker).

Quuxplusone commented 5 years ago

https://reviews.llvm.org/rL353859

cc'ing Dave Green because this is guarded by a TLI hook that probably should be overridden for AArch64 (it's already overridden for ARM).

So if I got that right, we're down to just the 2nd example from the problem description...and that's a tough one because it's not a constant-time pattern match. We want to be able to squash an arbitrary number of compare constants into bit magic. That might have to go in AggressiveInstCombine.

Quuxplusone commented 5 years ago
As mentioned in bug 38002 -- those patterns in switches lack optimization:

define i1 @foo(i8 %0) {
  switch i8 %0, label %2 [
    i8 43, label %3
    i8 45, label %3
  ]

2:
  br label %3

3:
  %4 = phi i1 [ false, %2 ], [ true, %1 ], [ true, %1 ]
  ret i1 %4
}

https://godbolt.org/z/MbbB3d

This is very unfortunate, because they emerge from a more generic C++ code
where loops over static arrays are unrolled, like:
  bool foo(char x) {
    char const m[2] = {'+', '-'};
    for (auto c : m)
      if (x == c)
        return true;
    return false;
  }

(GCC is fine with them https://godbolt.org/z/_rhFhP)