Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Poor codegen for targets with no native shifts #42529

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43559
Status NEW
Importance P normal
Reported by Joan Lluch (joan.lluch@icloud.com)
Reported on 2019-10-04 03:17:20 -0700
Last modified on 2019-10-12 07:06:43 -0700
Version trunk
Hardware All All
CC anton@korobeynikov.info, efriedma@quicinc.com, hfinkel@anl.gov, joan.lluch@icloud.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
This is an extension of bug 43542. This is a report of the same issue for the
following functions:

DAGCombiner::SimplifySelectCC
DAGCombiner::foldSelectCCToShiftAnd
static SDValue foldExtendedSignBitTest

Creation of shifts as a replacement of selects or extensions should be avoided
in DAGCombiner if shifts are not efficiently supported by the target.

The following IR functions are originators or the reported issue through
slightly different paths for the MSP430 target:

; Function Attrs: norecurse nounwind readnone
define dso_local i16 @test0(i16 %a) local_unnamed_addr #0 {
entry:
  %cmp = icmp slt i16 %a, 0
  %cond = select i1 %cmp, i16 -1, i16 0
  ret i16 %cond
}

; Function Attrs: norecurse nounwind readnone
define dso_local i16 @test1(i16 %a) local_unnamed_addr #0 {
entry:
  %cmp = icmp slt i16 %a, 0
  %cond = select i1 %cmp, i16 1, i16 0
  ret i16 %cond
}

; Function Attrs: norecurse nounwind readnone
define dso_local i16 @test2(i16 %a) local_unnamed_addr #0 {
entry:
  %cmp = icmp slt i16 %a, 0
  %cond = select i1 %cmp, i16 2, i16 0
  ret i16 %cond
}

; Function Attrs: norecurse nounwind readnone
define dso_local i16 @test3(i16 %a) local_unnamed_addr #0 {
entry:
  %cmp = icmp slt i16 %a, 0
  %cond = select i1 %cmp, i16 3, i16 0
  ret i16 %cond
}

The resulting code for the MSP430 is this, which is suboptimal:

    .globl  test0
    .p2align    1
    .type   test0,@function
test0:
    swpb    r12
    sxt r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    ret
.Lfunc_end0:
    .size   test0, .Lfunc_end0-test0

    .globl  test1
    .p2align    1
    .type   test1,@function
test1:
    swpb    r12
    mov.b   r12, r12
    clrc
    rrc r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    ret
.Lfunc_end1:
    .size   test1, .Lfunc_end1-test1

    .globl  test2
    .p2align    1
    .type   test2,@function
test2:
    swpb    r12
    mov.b   r12, r12
    clrc
    rrc r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    and #2, r12
    ret
.Lfunc_end2:
    .size   test2, .Lfunc_end2-test2

    .globl  test3
    .p2align    1
    .type   test3,@function
test3:
    swpb    r12
    sxt r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    rra r12
    and #3, r12
    ret
.Lfunc_end3:
    .size   test3, .Lfunc_end3-test3
Quuxplusone commented 5 years ago

We could add a TLI hook for DAGCombine to denote targets where large shifts are expensive, sure.

You probably want to be careful with how you choose your examples, though... for a right shift by 15, there are branchless sequences which are probably more efficient than a branch.

Quuxplusone commented 5 years ago
(In reply to Eli Friedman from comment #1)
> We could add a TLI hook for DAGCombine to denote targets where large shifts
> are expensive, sure.
>
> You probably want to be careful with how you choose your examples, though...
> for a right shift by 15, there are branchless sequences which are probably
> more efficient than a branch.

A TLI hook would be great, that should give some more flexibility depending on
how it is implemented, but I'm unsure if this is actually needed. So far the
check for legality that was recently proposed in SimplifySelectCC seems to be
enough for our purposes.

About the arithmetic right shift by 15 (SRA), I'm currently matching it, but
that's because the target has a sign extend word instruction that turns it to
zero or to all ones depending on the sign. I'm not aware of the other
transformations that you mention unless you meant replacing a SRL 15 by a
single bit shift left and an add with carry.

In any case, I think that it's better for targets to match these cases from the
actual selects, but not from shifts, as the purpose of this is to remove long
shifts. The targets must decide anyway what to do with selects, most often they
will be turned into branches, but not necessarily. Also branches on these 8 and
16 bit targets are cheaper than many instructions. As a rule of thumb, a taken
branch costs the same as two simple alu instructions.
Quuxplusone commented 5 years ago
(In reply to Eli Friedman from comment #1)
> We could add a TLI hook for DAGCombine to denote targets where large shifts
> are expensive, sure.
>
> You probably want to be careful with how you choose your examples, though...
> for a right shift by 15, there are branchless sequences which are probably
> more efficient than a branch.

A TLI hook would be great, that should give some additional flexibility
depending on how it is implemented, but I'm unsure if this is actually needed.
So far the check for legality that was recently proposed in SimplifySelectCC
seems to be enough for our purposes.

About the arithmetic right shift by 15 (SRA), I'm currently matching it, but
that's because the target has a sign-extend word instruction that sets a
register to zero or to all ones depending on a sign. I'm not aware of other
transformations except the replacement of a SRL 15 by a carry-generating single
bit shift left followed by an add zero with carry, but that is rather target
dependent I think.

In any case, I think that it's better for targets to match these cases from the
actual selects, not from the shifts. The targets must decide anyway what to do
with selects, most often they will be turned into branches, but not
necessarily. Also we must not forget that branches on these 8 and 16 bit
targets are cheap. As a rule of thumb, a not taken branch costs the same as one
alu instruction, and a taken branch costs the same as two alu operations.
Quuxplusone commented 5 years ago
(In reply to Joan Lluch from comment #2)
> (In reply to Eli Friedman from comment #1)
> > We could add a TLI hook for DAGCombine to denote targets where large shifts
> > are expensive, sure.
> >
> > You probably want to be careful with how you choose your examples, though...
> > for a right shift by 15, there are branchless sequences which are probably
> > more efficient than a branch.
>
> A TLI hook would be great, that should give some more flexibility depending
> on how it is implemented, but I'm unsure if this is actually needed. So far
> the check for legality that was recently proposed in SimplifySelectCC seems
> to be enough for our purposes.

Oh, I missed this discussion.

> About the arithmetic right shift by 15 (SRA), I'm currently matching it, but
> that's because the target has a sign extend word instruction that turns it
> to zero or to all ones depending on the sign. I'm not aware of the other
> transformations that you mention unless you meant replacing a SRL 15 by a
> single bit shift left and an add with carry.

Yes, I just meant there are special cases.  For shifts with one result bit, you
can use add with carry or sub with carry.  For other large shift amounts, you
can rotate in the opposite direction, then mask the result, possibly.
Quuxplusone commented 5 years ago
(In reply to Eli Friedman from comment #4)

> Yes, I just meant there are special cases.  For shifts with one result bit,
> you can use add with carry or sub with carry.  For other large shift
> amounts, you can rotate in the opposite direction, then mask the result,
> possibly.

Yes, I know all this, but one of my points is that we must leave the target to
decide this. DAGCombine should not determine if a particular transformation is
profitable when the transform involves instructions that are not "legal", such
as shifts.
Quuxplusone commented 5 years ago

I suggest solving this by implementing a TLI virtual function that might be named "shouldConvertToShift", or something to that effect, that can be overriden by targets. (For example, similar to functions like shouldConvertConstantLoadToIntImm or isLegalAddressingMode)

In the DAGCombine code, use the above function to disable transforms that would create too large shifts for particular targets. The function should return false and pass the amount and type of the queried shift.

The default implementation should return always true, so that shifts are normally allowed as usual.

Targets with expensive shifts may return false for shift amounts above certain threshold. For example I think that a suitable threshold for the MSP430 might be 2.

I think that this approach is more flexible than checking for shift legality, and should suit a greater number of targets. Another advantage is that it won't change anything for any targets until they get updated.

I can help by identifying functions in DAGCombine and elsewhere where performing this check may be useful.

John

Quuxplusone commented 5 years ago

Sorry, I meant the TLI function suggested above to return a bool, not "false" (sorry for the typo)