Missed vectorization due to recently added transformation

Quuxplusone commented 3 years ago


Bugzilla Link	PR51436
Status	NEW
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2021-08-10 13:12:01 -0700
Last modified on	2021-10-20 06:10:52 -0700
Version	trunk
Hardware	PC Linux
CC	lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)	rGc931d35216a320465947a5a46b158cf71024458d
Attachments
Blocks
Blocked by
See also	PR47437

unsigned long long foo(unsigned long long a)
{
    return (1048575 - a) << 44;
}
void bar(unsigned long long *a)
{
    for (int i = 0; i < 8; i++)
        a[i] = foo(a[i]);
}

bar is no longer vectorized with LLVM 12+ since that pattern is now rewriten as
mulplication and addition.

LLVM 11:
foo(unsigned long long): # @foo(unsigned long long)
  shl rdi, 44
  movabs rax, -17592186044416
  sub rax, rdi
  ret

LLVM 12
foo(unsigned long long):                                # @foo(unsigned long
long)
        movabs  rax, -17592186044416
        imul    rdi, rax
        add     rax, rdi
        ret

LLVM 11:
remark: vectorized loop (vectorization width: 2, interleaved count: 2) [-
Rpass=loop-vectorize]
    for (int i = 0; i < N; i++)

LLVM 12:
remark: the cost-model indicates that vectorization is not beneficial [-Rpass-
missed=loop-vectorize]
    for (int i = 0; i < N; i++)

So looks like we have some issues in the cost model.

https://godbolt.org/z/c39qr7nvP

Quuxplusone commented 3 years ago

(In reply to David Bolvansky from comment #0)
>
> So looks like we have some issues in the cost model.
>
>
> https://godbolt.org/z/c39qr7nvP

Not sure its really the cost model's fault - we are slightly underestimating
the scalar i64 mul cost, but the vector costs are pretty good now (we might be
able to improve it slightly for that constant case as all the lower bits are
known zero).

I'm more sceptical of why we replaced a sub+shift with a mul+add.

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #1)
> I'm more sceptical of why we replaced a sub+shift with a mul+add.

This is the transform in question:

define i64 @_Z3fooy(i64 %0) {
  %2 = shl i64 %0, 44
  %3 = sub nuw nsw i64 -17592186044416, %2
  ret i64 %3
}

define i64 @_Z3fooy(i64 %0) {
  %2 = mul i64 %0, -17592186044416
  %3 = add i64 %2, -17592186044416
  ret i64 %3
}

It's part of InstCombine's Negator functionality. This is probably the commit
where it was enabled:
https://github.com/llvm/llvm-project/commit/f5df5cd5586ae9cfb2d9e53704dfc76f47aff149

IIRC, the reasoning was that we'd be better off canonicalizing more
aggressively to 'add' than 'sub', and we could reassociate/decompose our way
out of 'mul' problems.

But I'm not sure how we'd fix this vectorization example. The cost model is
telling the truth - we didn't undo the multiplication in codegen.

Even if we add that transform in the backend, it's too late for vectorization
(unless we have the cost model mimic whatever logic is needed to see that this
pattern is better as a sub+shift).

@lebedev.ri - any ideas?

Quuxplusone commented 3 years ago

Either/or:
* the cost model should say that multiplication by a negative power of two has
the cost of negation of left-shift
* vectorizers should know to expand and costmodel that pattern themselves

Quuxplusone commented 3 years ago

(In reply to Roman Lebedev from comment #3)
> Either/or:
> * the cost model should say that multiplication by a negative power of two
> has the cost of negation of left-shift
> * vectorizers should know to expand and costmodel that pattern themselves

OK, I'll deal with this next week when I'm back from PTO.

Quuxplusone commented 3 years ago

Not sure if it changes anything, but we are missing the canonicalization of shl
with negate:
https://alive2.llvm.org/ce/z/-WUDN5

Do we prefer to push the negation down or pull it up?

Also note that DAGCombiner does have a generic unfold of the multiply that
seems to have been here since the dawn of LLVM:
https://github.com/llvm/llvm-project/blob/b97afc9dc0e96116b1369e4dbbcf9ada698b8802/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L3839

...but it doesn't fire on the 64-bit example here for x86 because we made the
big constant opaque since it has multiple uses.

Quuxplusone commented 3 years ago

(In reply to Sanjay Patel from comment #5)
> Not sure if it changes anything, but we are missing the canonicalization of
> shl with negate:
> https://alive2.llvm.org/ce/z/-WUDN5
>
> Do we prefer to push the negation down or pull it up?
We should almost always hoist it, because Negator can only sink negations,
so if it could have been sunk, Negator would have taken care of it already,
but now that it has been hoisted, it might manage to tuck it away elsewhere.

It's pretty ugly that we'll have to smear such hoisting code all over
InstCombine, but generally we are not okay with looking upwards the use chain
(def->use), so i'm not sure if we'd be fine with going against the grain for
this?

> Also note that DAGCombiner does have a generic unfold of the multiply that
> seems to have been here since the dawn of LLVM:
> https://github.com/llvm/llvm-project/blob/
> b97afc9dc0e96116b1369e4dbbcf9ada698b8802/llvm/lib/CodeGen/SelectionDAG/
> DAGCombiner.cpp#L3839
>
> ...but it doesn't fire on the 64-bit example here for x86 because we made
> the big constant opaque since it has multiple uses.

Quuxplusone commented 3 years ago

rGc931d35216a increased the throughput cost of i64 multiplies to a more realistic 2cy (instead of 1cy) which fixes this on at least AVX1+ targets where the v2i64 mul cost is pretty low.

To support older targets we're going to either have to split some scalar arithmetic costs based on age (we currently only split on i386-vs-x64), or hint that we're multiplying by negative-pow2.

TTI::OperandValueProperties already hints if we have a pow2 value so that shouldn't be difficult, although we don't do a very good idea of feeding accurate OperandValueProperties values in most cost calls.

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #7)
>
> TTI::OperandValueProperties already hints if we have a pow2 value so that
> shouldn't be difficult, although we don't do a very good idea of feeding
> accurate OperandValueProperties values in most cost calls.

I've attempted to add negpow2 support: https://reviews.llvm.org/D111968

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #8)
> (In reply to Simon Pilgrim from comment #7)
> >
> > TTI::OperandValueProperties already hints if we have a pow2 value so that
> > shouldn't be difficult, although we don't do a very good idea of feeding
> > accurate OperandValueProperties values in most cost calls.
>
> I've attempted to add negpow2 support: https://reviews.llvm.org/D111968

This negpow2 patch has been expanded and now fixes this bug directly.

Quuxplusone / LLVMBugzillaTest

Missed vectorization due to recently added transformation #50403