Open Quuxplusone opened 3 years ago
Bugzilla Link | PR51436 |
Status | NEW |
Importance | P enhancement |
Reported by | David Bolvansky (david.bolvansky@gmail.com) |
Reported on | 2021-08-10 13:12:01 -0700 |
Last modified on | 2021-10-20 06:10:52 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | rGc931d35216a320465947a5a46b158cf71024458d |
Attachments | |
Blocks | |
Blocked by | |
See also | PR47437 |
(In reply to David Bolvansky from comment #0)
>
> So looks like we have some issues in the cost model.
>
>
> https://godbolt.org/z/c39qr7nvP
Not sure its really the cost model's fault - we are slightly underestimating
the scalar i64 mul cost, but the vector costs are pretty good now (we might be
able to improve it slightly for that constant case as all the lower bits are
known zero).
I'm more sceptical of why we replaced a sub+shift with a mul+add.
(In reply to Simon Pilgrim from comment #1)
> I'm more sceptical of why we replaced a sub+shift with a mul+add.
This is the transform in question:
define i64 @_Z3fooy(i64 %0) {
%2 = shl i64 %0, 44
%3 = sub nuw nsw i64 -17592186044416, %2
ret i64 %3
}
define i64 @_Z3fooy(i64 %0) {
%2 = mul i64 %0, -17592186044416
%3 = add i64 %2, -17592186044416
ret i64 %3
}
It's part of InstCombine's Negator functionality. This is probably the commit
where it was enabled:
https://github.com/llvm/llvm-project/commit/f5df5cd5586ae9cfb2d9e53704dfc76f47aff149
IIRC, the reasoning was that we'd be better off canonicalizing more
aggressively to 'add' than 'sub', and we could reassociate/decompose our way
out of 'mul' problems.
But I'm not sure how we'd fix this vectorization example. The cost model is
telling the truth - we didn't undo the multiplication in codegen.
Even if we add that transform in the backend, it's too late for vectorization
(unless we have the cost model mimic whatever logic is needed to see that this
pattern is better as a sub+shift).
@lebedev.ri - any ideas?
Either/or:
* the cost model should say that multiplication by a negative power of two has
the cost of negation of left-shift
* vectorizers should know to expand and costmodel that pattern themselves
(In reply to Roman Lebedev from comment #3)
> Either/or:
> * the cost model should say that multiplication by a negative power of two
> has the cost of negation of left-shift
> * vectorizers should know to expand and costmodel that pattern themselves
OK, I'll deal with this next week when I'm back from PTO.
Not sure if it changes anything, but we are missing the canonicalization of shl
with negate:
https://alive2.llvm.org/ce/z/-WUDN5
Do we prefer to push the negation down or pull it up?
Also note that DAGCombiner does have a generic unfold of the multiply that
seems to have been here since the dawn of LLVM:
https://github.com/llvm/llvm-project/blob/b97afc9dc0e96116b1369e4dbbcf9ada698b8802/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L3839
...but it doesn't fire on the 64-bit example here for x86 because we made the
big constant opaque since it has multiple uses.
(In reply to Sanjay Patel from comment #5)
> Not sure if it changes anything, but we are missing the canonicalization of
> shl with negate:
> https://alive2.llvm.org/ce/z/-WUDN5
>
> Do we prefer to push the negation down or pull it up?
We should almost always hoist it, because Negator can only sink negations,
so if it could have been sunk, Negator would have taken care of it already,
but now that it has been hoisted, it might manage to tuck it away elsewhere.
It's pretty ugly that we'll have to smear such hoisting code all over
InstCombine, but generally we are not okay with looking upwards the use chain
(def->use), so i'm not sure if we'd be fine with going against the grain for
this?
> Also note that DAGCombiner does have a generic unfold of the multiply that
> seems to have been here since the dawn of LLVM:
> https://github.com/llvm/llvm-project/blob/
> b97afc9dc0e96116b1369e4dbbcf9ada698b8802/llvm/lib/CodeGen/SelectionDAG/
> DAGCombiner.cpp#L3839
>
> ...but it doesn't fire on the 64-bit example here for x86 because we made
> the big constant opaque since it has multiple uses.
rGc931d35216a increased the throughput cost of i64 multiplies to a more realistic 2cy (instead of 1cy) which fixes this on at least AVX1+ targets where the v2i64 mul cost is pretty low.
To support older targets we're going to either have to split some scalar arithmetic costs based on age (we currently only split on i386-vs-x64), or hint that we're multiplying by negative-pow2.
TTI::OperandValueProperties already hints if we have a pow2 value so that shouldn't be difficult, although we don't do a very good idea of feeding accurate OperandValueProperties values in most cost calls.
(In reply to Simon Pilgrim from comment #7)
>
> TTI::OperandValueProperties already hints if we have a pow2 value so that
> shouldn't be difficult, although we don't do a very good idea of feeding
> accurate OperandValueProperties values in most cost calls.
I've attempted to add negpow2 support: https://reviews.llvm.org/D111968
(In reply to Simon Pilgrim from comment #8)
> (In reply to Simon Pilgrim from comment #7)
> >
> > TTI::OperandValueProperties already hints if we have a pow2 value so that
> > shouldn't be difficult, although we don't do a very good idea of feeding
> > accurate OperandValueProperties values in most cost calls.
>
> I've attempted to add negpow2 support: https://reviews.llvm.org/D111968
This negpow2 patch has been expanded and now fixes this bug directly.