Open Quuxplusone opened 5 years ago
Bugzilla Link | PR40611 |
Status | NEW |
Importance | P enhancement |
Reported by | Nikita Kniazev (nok.raven@gmail.com) |
Reported on | 2019-02-05 08:58:36 -0800 |
Last modified on | 2019-11-19 10:35:20 -0800 |
Version | trunk |
Hardware | All All |
CC | david.green@arm.com, llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR38002, PR40750 |
We have an IR transform for the more specific power-of-2-minus-1 constant
version of this pattern, but we're missing these more general ones:
Name: cmp eq 0
%a = or i32 %x, C1
%r = icmp eq i32 %a, C1
=>
%b = and i32 %x, ~C1
%r = icmp eq i32 %b, 0
Name: cmp ne 0
%a = or i32 %x, C1
%r = icmp ne i32 %a, C1
=>
%b = and i32 %x, ~C1
%r = icmp ne i32 %b, 0
https://rise4fun.com/Alive/9sl
Should be partly fixed with:
https://reviews.llvm.org/rL353313
The other 2 problems are independent and not as simple at first look.
I am sorry if I am wrong (I have zero knowledge in writing compiler
optimizations), but it seems to me that such optimization will solve the third
case:
Pre: isPowerOf2(C1 ^ C2)
%a = icmp eq i8 %x, C1
%b = icmp eq i8 %x, C2
%r = or i1 %a, %b
=>
%y = add i8 %x, -umin(C1, C2)
%z = and i8 %y, ~abs(C1 - C2)
%r = icmp eq i8 %z, 0
https://rise4fun.com/Alive/q6N
(In reply to Nikita Kniazev from comment #3)
> I am sorry if I am wrong (I have zero knowledge in writing compiler
> optimizations), but it seems to me that such optimization will solve the
> third case:
>
> Pre: isPowerOf2(C1 ^ C2)
> %a = icmp eq i8 %x, C1
> %b = icmp eq i8 %x, C2
> %r = or i1 %a, %b
> =>
> %y = add i8 %x, -umin(C1, C2)
> %z = and i8 %y, ~abs(C1 - C2)
> %r = icmp eq i8 %z, 0
>
> https://rise4fun.com/Alive/q6N
Nice! Thanks for solving that precondition generally.
So the tricky part is: where should we make this transform in the compiler?
For IR canonicalization (not necessarily optimization), it depends on how we
think the IR will interact with other IR transforms. I am leaning towards the
original (2 icmp) form there because we have a huge number of
transforms/analysis centered on icmp.
If we agree on that, then the transform will go in the backend (codegen). But
there's another question about that: is the transform universally good?
We know it does well on x86. How about AArch64 or other targets? If it's only
useful for x86, we can add a transform for x86 only.
If we have evidence that it looks better on multiple targets, then we want to
add the transform generically in DAGCombiner.
But if there is any target where it does not look better, then we have to
provide an opt-out to avoid the transform or reverse the transform.
I am not an expert in any particular architecture, but it seems to me that
multiple targets will benefit from the optimization (I do not know if there any
architecture that allows you compare a value to multiple constants). From what
I can test myself: GCC does that kind of an optimization at least for x86(-64)
and AArch. Other targets I can judge mostly by reduced amount of instructions
(MIPS, SPARC) and eliminated branches (PowerPC).
AArch32/AArch64 (+Thumb) https://godbolt.org/z/38Z5Ci
MIPS/MIPS64 https://godbolt.org/z/6DCl1b
PowerPC/PowerPC64 https://godbolt.org/z/T-hH2K
SPARC/SPARCv9 https://godbolt.org/z/TDclNV
(In reply to Nikita Kniazev from comment #5)
> I am not an expert in any particular architecture, but it seems to me that
> multiple targets will benefit from the optimization (I do not know if there
> any architecture that allows you compare a value to multiple constants).
> From what I can test myself: GCC does that kind of an optimization at least
> for x86(-64) and AArch. Other targets I can judge mostly by reduced amount
> of instructions (MIPS, SPARC) and eliminated branches (PowerPC).
>
> AArch32/AArch64 (+Thumb) https://godbolt.org/z/38Z5Ci
> MIPS/MIPS64 https://godbolt.org/z/6DCl1b
> PowerPC/PowerPC64 https://godbolt.org/z/T-hH2K
> SPARC/SPARCv9 https://godbolt.org/z/TDclNV
Thanks - that suggests that we can do this in DAGCombiner without too much
trouble. I'll start a patch and see how it looks.
Note that for every one of these boolean bit hacks, there should be a 'not' (DeMorgan) pattern that we need to match too. Something like this:
I need to go back to the earlier patch and see if we are handling that correctly.
It turns out we already have a target hook to decide when to do similar
transforms (should've remembered because I'm the one that added it :) ), so we
just need to match the right pattern.
I don't think this is the right pre-condition:
isPowerOf2(C1 ^ C2)
That doesn't match the pattern from the description where the constants are
43/45.
https://rise4fun.com/Alive/XslKj ?
The precondition should be isPowerOf2(umax(C0, C1) - umin(C0, C1))
(In reply to Nikita Kniazev from comment #9)
> The precondition should be `isPowerOf2(umax(C0, C1) - umin(C0, C1))`
>
> https://rise4fun.com/Alive/vMDI
I think that's the same as what I wrote except I forced C0 to be the larger
constant to simplify things.
I think that's the same as what I wrote except I forced C0 to be the larger constant to simplify things.
Yeah. I just showed what I actually wanted to achieve in the original post.
(I tried isPowerOf2(abs(C1 - C2))
, but it did not work well, because it
looks like abs
output can be treated as negative value and isPowerOf2
works for negative values too).
I found GCC optimizations for the cases above. Here is the commit if you want too look at it https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=203627 (I do not know if it is allowed to borrow something from GCC, or even to post descriptions from the code here on bug tracker).
https://reviews.llvm.org/rL353859
cc'ing Dave Green because this is guarded by a TLI hook that probably should be overridden for AArch64 (it's already overridden for ARM).
So if I got that right, we're down to just the 2nd example from the problem description...and that's a tough one because it's not a constant-time pattern match. We want to be able to squash an arbitrary number of compare constants into bit magic. That might have to go in AggressiveInstCombine.
As mentioned in bug 38002 -- those patterns in switches lack optimization:
define i1 @foo(i8 %0) {
switch i8 %0, label %2 [
i8 43, label %3
i8 45, label %3
]
2:
br label %3
3:
%4 = phi i1 [ false, %2 ], [ true, %1 ], [ true, %1 ]
ret i1 %4
}
https://godbolt.org/z/MbbB3d
This is very unfortunate, because they emerge from a more generic C++ code
where loops over static arrays are unrolled, like:
bool foo(char x) {
char const m[2] = {'+', '-'};
for (auto c : m)
if (x == c)
return true;
return false;
}
(GCC is fine with them https://godbolt.org/z/_rhFhP)