Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[InstCombine] canonicalize IR to funnel shift intrinsics #36390

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR37417
Status NEW
Importance P enhancement
Reported by Sanjay Patel (spatel+llvm@rotateright.com)
Reported on 2018-05-11 09:04:30 -0700
Last modified on 2020-10-25 11:26:31 -0700
Version trunk
Hardware PC All
CC lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, oneill+llvmbugs@cs.hmc.edu
Fixed by commit(s)
Attachments
Blocks
Blocked by PR46896
See also PR37387, PR38152, PR39624, PR34924, PR15880, PR20750, PR35155
Forking this off from bug 37387.

LICM (I think) may split a rotate pattern in IR across blocks. Once that
happens, DAGCombiner::MatchRotate() can't put the pieces back together again.
CodeGenPrepare must reverse LICM to make the rotate visible to the DAG when the
target has a legal/custom ROTL/ROTR node.

Here's a basic example (should verify that the unrolled case works too):

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

$ ./clang -O2 ror.c -S -o -  -fno-unroll-loops -emit-llvm
...
define void @rotateInLoop(i32* nocapture %x, i32 %b, i32* nocapture readonly
%a, i32 %N) {
entry:
  %cmp12 = icmp eq i32 %N, 0
  br i1 %cmp12, label %for.cond.cleanup, label %for.body.lr.ph

for.body.lr.ph:
  %sub = sub nsw i32 32, %b  <--- this should be moved back into the loop
  %wide.trip.count = zext i32 %N to i64
  br label %for.body

for.cond.cleanup:
  ret void

for.body:
  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !3
  %shr = lshr i32 %0, %b
  %shl = shl i32 %0, %sub
  %or = or i32 %shr, %shl
  %idxprom3 = zext i32 %or to i64
  %arrayidx4 = getelementptr inbounds i32, i32* %x, i64 %idxprom3
  %1 = trunc i64 %indvars.iv to i32
  store i32 %1, i32* %arrayidx4, align 4, !tbaa !3
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}
Quuxplusone commented 6 years ago
If we add a rotate intrinsic:
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123292.html
...then we shouldn't need this chunk in CGP.
Quuxplusone commented 5 years ago
Please also consider the UB avoiding pattern:
(x << (n % width)) | (x >> (-n % width))

And how does this interact with vectorization?
Quuxplusone commented 5 years ago

_Bug 20750 has been marked as a duplicate of this bug._

Quuxplusone commented 5 years ago
Apologies for not updating rotate bugs with current status. We added more
general funnel shift intrinsics to LLVM:
https://reviews.llvm.org/D49242

These can be used to represent rotates in IR and will become canonical form
(hopefully not too far away).

Given that we have these new intrinsics, I'm going to morph this bug into a
request to canonicalize the IR shown here to the intrinsics. (There's no reason
to pursue a backend hack in CGP to accomplish the same thing.)
Quuxplusone commented 5 years ago
(In reply to Trass3r from comment #2)
> And how does this interact with vectorization?

We have to enhance the vectorizers to recognize the funnel shift intrinsics to
avoid regressions:
https://reviews.llvm.org/rL346661
Quuxplusone commented 5 years ago
(In reply to Sanjay Patel from comment #5)
> (In reply to Trass3r from comment #2)
> > And how does this interact with vectorization?
>
> We have to enhance the vectorizers to recognize the funnel shift intrinsics
> to avoid regressions:
> https://reviews.llvm.org/rL346661

Getting the cost models in shape:
https://reviews.llvm.org/rL346683
https://reviews.llvm.org/rL346688

Converting narrow-able rotate patterns to minimal shift+logic ops:
https://reviews.llvm.org/rL346713
https://reviews.llvm.org/rL346716
Quuxplusone commented 5 years ago
Bug 34924 shows a rotate variant that includes compare+branch, so we'll need to
handle that as well as a 'select' variant of that pattern:

define i32 @rot_select(i32 %x, i32 %shamt){
  %cmp = icmp eq i32 %shamt, 0
  %shr = lshr i32 %x, %shamt
  %sub = sub i32 32, %shamt
  %shl = shl i32 %x, %sub
  %or = or i32 %shl, %shr
  %r = select i1 %cmp, i32 %x, i32 %or
  ret i32 %r
}

https://rise4fun.com/Alive/yMQ
Quuxplusone commented 5 years ago
AFAIK, this is the last step that was needed before we add the transform to
produce intrinsics in instcombine:
https://reviews.llvm.org/rL346854

We should be the pattern that includes select into the expected
{sub/and/and/shl/lshr/or} form with this change:
https://reviews.llvm.org/rL346807

And we have better support to reduce the IR once it is in intrinsic form with:
https://reviews.llvm.org/rL346814
Quuxplusone commented 5 years ago
Am i misreading the langref (and thus the ir is invalid),
or is the naive funnel shift IR not being canonicalized?
https://godbolt.org/z/W6WxrQ
Quuxplusone commented 5 years ago
(In reply to Roman Lebedev from comment #9)
> Am i misreading the langref (and thus the ir is invalid),
> or is the naive funnel shift IR not being canonicalized?
> https://godbolt.org/z/W6WxrQ

IR looks good. We don't have that, these alternate DAG funnel expansions:
    // fshl: (X << (Z % BW)) | (Y >> (BW - (Z % BW)))
    // fshr: (X << (BW - (Z % BW))) | (Y >> (Z % BW))

...or even the simpler rotate patterns canonicalized in IR yet. I keep thinking
we're close to done with this, but I've been perpetually too optimistic.
Quuxplusone commented 5 years ago
(In reply to Sanjay Patel from comment #10)
> (In reply to Roman Lebedev from comment #9)
> > Am i misreading the langref (and thus the ir is invalid),
> > or is the naive funnel shift IR not being canonicalized?
> > https://godbolt.org/z/W6WxrQ
>
> IR looks good.
Ok, thank you for checking, that is reassuring.

> We don't have that, these alternate DAG funnel expansions:
>     // fshl: (X << (Z % BW)) | (Y >> (BW - (Z % BW)))
>     // fshr: (X << (BW - (Z % BW))) | (Y >> (Z % BW))
>
> ...or even the simpler rotate patterns canonicalized in IR yet. I keep
> thinking we're close to done with this, but I've been perpetually too
> optimistic.
Quuxplusone commented 5 years ago
Copying the IR from the godbolt link here for reference. I'm guessing there are
a number of variations if the shift amount is not the same width as the other
variables (similar to the simpler rotate patterns). This is going to take some
large pattern matching.

Any help in writing tests/patches is appreciated. I'm not sure when I'll get to
these. The tricky part is that we can't canonicalize simpler UB-unsafe
sequences to the intrinsics because the DAG expansion might be worse perf. This
is discussed here:
https://reviews.llvm.org/D55604?id=177858#inline-492396 and bug 34924.

define i32 @fshl_pow2_bitwidth_using_wide_type(i32 %a, i32 %b, i32 %c) {
  %widea = zext i32 %a to i64
  %wideb = zext i32 %b to i64
  %high = shl nuw i64 %widea, 32
  %wide = or i64 %high, %wideb
  %cmodwidth = and i32 %c, 31
  %cmodwidthwide = zext i32 %cmodwidth to i64
  %shifted = shl i64 %wide, %cmodwidthwide
  %skiplowhalf = lshr i64 %shifted, 32
  %trunc = trunc i64 %skiplowhalf to i32
  ret i32 %trunc
}

define i32 @fshr_pow2_bitwidth_using_wide_type(i32 %a, i32 %b, i32 %c) {
  %widea = zext i32 %a to i64
  %wideb = zext i32 %b to i64
  %high = shl nuw i64 %widea, 32
  %wide = or i64 %high, %wideb
  %cmodwidth = and i32 %c, 31
  %cmodwidthwide = zext i32 %cmodwidth to i64
  %shifted = lshr i64 %wide, %cmodwidthwide
  %trunc = trunc i64 %shifted to i32
  ret i32 %trunc
}
Quuxplusone commented 5 years ago
These are the canonicalizations we have so far:

1. UB-safe shift+logic, matching types
https://reviews.llvm.org/rL350672

2. Various flavors of shift+logic with mismatched types
https://reviews.llvm.org/rL350419

3. Semi-UB-safe shift+logic+select, matching types
https://reviews.llvm.org/rL350199

4. Semi-UB-safe shift+logic with guard branch, matching types:
https://reviews.llvm.org/rL349396