Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Adding "minsize" attribute to function increases size of opt output 15 times #42294

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43324
Status NEW
Importance P enhancement
Reported by Mikael Holmén (mikael.holmen@ericsson.com)
Reported on 2019-09-16 04:29:13 -0700
Last modified on 2019-12-26 00:28:14 -0800
Version trunk
Hardware PC Linux
CC bjorn.a.pettersson@ericsson.com, florian_hahn@apple.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments bbi-32266_2.ll.gz (209495 bytes, application/gzip)
bbi-32266_3.ll (156929 bytes, text/x-matlab)
bbi-32266_4.ll (1571 bytes, text/plain)
Blocks
Blocked by
See also
If we run opt like this:

opt -disable-basicaa -licm -loop-unroll -O1 -unroll-threshold=0 -o bbi-
32266_2.opt.bc bbi-32266_2.ll

and check the size of the output bbi-32266_2.opt.bc I get

-rw-r--r-- 1 uabelho eusers 203308 sep 16 13:09 bbi-32266_2.opt.bc

Then, if I just add a "minsize" attribute to function main in the input
and run opt the same way, I instead get this output:

-rw-r--r-- 1 uabelho eusers 3177960 sep 16 13:12 ./bbi-32266_2.opt.bc

So, adding "minsize" makes the size of the program increase around 15 times.
Quuxplusone commented 5 years ago
This seems to have gotten significantly worse with r364398:

    [LoopUnroll] Add support for loops with exiting headers and uncond latches.

    This patch generalizes the UnrollLoop utility to support loops that exit
    from the header instead of the latch. Usually, LoopRotate would take care
    of must of those cases, but in some cases (e.g. -Oz), LoopRotate does
    not kick in.

    Codesize impact looks relatively neutral on ARM64 with -Oz + LTO.

    Program                                         master     patch     diff
     External/S.../CFP2006/447.dealII/447.dealII   629060.00  627676.00  -0.2%
     External/SPEC/CINT2000/176.gcc/176.gcc        1245916.00 1244932.00 -0.1%
     MultiSourc...Prolangs-C/simulator/simulator   86100.00   86156.00    0.1%
     MultiSourc...arks/Rodinia/backprop/backprop   66212.00   66252.00    0.1%
     MultiSourc...chmarks/Prolangs-C++/life/life   67276.00   67312.00    0.1%
     MultiSourc...s/Prolangs-C/compiler/compiler   69824.00   69788.00   -0.1%
     MultiSourc...Prolangs-C/assembler/assembler   86672.00   86696.00    0.0%

    Reviewers: efriedma, vsk, paquette

    Reviewed By: paquette

    Differential Revision: https://reviews.llvm.org/D61962

    llvm-svn: 364398

If I disable that change the size of the opt output is

-rw-r--r-- 1 uabelho eusers 308556 sep 16 13:33 bbi-32266_2.opt.bc

so minsize still increases the size with 50% compared to not using minsize, but
we don't get the 15 times increase.
Quuxplusone commented 5 years ago

Thanks for filing the issue! I do not think r364398 is the real problem, I suspect the loop unrolling cost modeling makes a wrong decision.

Quuxplusone commented 5 years ago

Attached bbi-32266_2.ll.gz (209495 bytes, application/gzip): reproducer

Quuxplusone commented 5 years ago
(In reply to Florian Hahn from comment #2)
> Thanks for filing the issue! I do not think r364398 is the real problem, I
> suspect the loop unrolling cost modeling makes a wrong decision.

Yes, even the 50% increase with "minsize" is wrong to me so I suppose r364398
exposes the problem much more.
Quuxplusone commented 5 years ago
Slightly reduced reproducer:

 opt -loop-unroll -loop-rotate -loop-unroll -unroll-threshold=0 -o bbi-32266_3.opt.bc bbi-32266_3.ll

gives

 -rw-r--r-- 1 uabelho eusers 61408 sep 17 15:58 bbi-32266_3.opt.bc

then adding "minsize" to main and rerun gives

 -rw-r--r-- 1 uabelho eusers 930088 sep 17 15:58 bbi-32266_3.opt.bc
Quuxplusone commented 5 years ago

Attached bbi-32266_3.ll (156929 bytes, text/x-matlab): reduced reproducer

Quuxplusone commented 5 years ago
Ok, now really reduced reproducer. The input is just 8 BBs large and contains 8
instructions apart from the branches:

  opt -loop-unroll -loop-rotate -loop-unroll -unroll-threshold=0 -o bbi-32266_4.opt.bc bbi-32266_4.ll

gives

 -rw-r--r-- 1 uabelho eusers 1784 sep 25 10:15 bbi-32266_4.opt.bc

and when adding minsize to main we get

 -rw-r--r-- 1 uabelho eusers 421288 sep 25 10:16 bbi-32266_4.opt.bc

The function tryToUnrollLoop in LoopUnrollPass.cpp contains this:

  // When optimizing for size, use LoopSize + 1 as threshold (we use < Threshold
  // later), to (fully) unroll loops, if it does not increase code size.
  if (OptForSize)
    UP.Threshold = std::max(UP.Threshold, LoopSize + 1);

so the "MinSize" indeed has some special handling (which seems to ignore the -
unroll-threshold=0 flag), but it seems like it fails to adhere to the "if it
does not increase code size" comment above for some reason.
Quuxplusone commented 5 years ago

Attached bbi-32266_4.ll (1571 bytes, text/plain): really reduced reproducer

Quuxplusone commented 5 years ago
Ok, so without minsize we bail out at

  // Exit early if unrolling is disabled. For OptForSize, we pick the loop size
  // as threshold later on.
  if (UP.Threshold == 0 && (!UP.Partial || UP.PartialThreshold == 0) &&
      !OptForSize)
    return LoopUnrollResult::Unmodified;

since we're using -unroll-threshold=0. So we get no unrolling at all.

With minsize, as the comment says we don't bail out but instead we set the
threshold to

  // When optimizing for size, use LoopSize + 1 as threshold (we use < Threshold
  // later), to (fully) unroll loops, if it does not increase code size.
  if (OptForSize)
    UP.Threshold = std::max(UP.Threshold, LoopSize + 1);

Then we end up in computeUnrollCount where we see that the loop has a
loop unroll pragma and we do

  // 2nd priority is unroll count set by pragma.
  unsigned PragmaCount = UnrollCountPragmaValue(L);
  if (PragmaCount > 0) {
    UP.Count = PragmaCount;
    UP.Runtime = true;
    UP.AllowExpensiveTripCount = true;
    UP.Force = true;
    if ((UP.AllowRemainder || (TripMultiple % PragmaCount == 0)) &&
        getUnrolledLoopSize(LoopSize, UP) < PragmaUnrollThreshold)
      return true;

so then we just set the unroll count (UP.Count) to the number set in the
pragma, and we don't consider the UP.Threshold that we previously calculated at
all?

As far as I can see, UnrollLoop then just follows orders and unrolls the loop
UP.count times, and we get the first step of code expansion.
Quuxplusone commented 5 years ago

Yes, I think we have to account for user-set unrolling thresholds there I think, if -unroll-threshold=0 is used to disable unrolling.

There still might be an issue with the unrolling cost modeling (or further down the pipeline), if it believes the unrolled loop is the same size as the original loop.

Quuxplusone commented 4 years ago
I've been looking a bit at

  opt -passes='print<loops>,unroll,print<loops>,loop(rotate),print<loops>,unroll' -unroll-threshold=0 bbi-32266_4_with_minsize.ll -o - -S -debug-only=loop-unroll

where bbi-32266_4_with_minsize.ll is the "really reduced reproducer" with a #0
added on the function to set the minsize attribute.

(1) Before the first unroll we have these loops:

Loop at depth 1 containing: %bb.7<header><latch>
Loop at depth 1 containing:
%bb.1<header><exiting>,%bb.2,%bb.3<latch>,%bb.4,%bb.5,%bb.6
    Loop at depth 2 containing: %bb.2<header><exiting>,%bb.4<latch>,%bb.5<latch>,%bb.6
        Loop at depth 3 containing: %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
Loop at depth 1 containing: %bb.7<header><latch>

(2) And then the unroller fails to unroll the inner loops on depth 2 and 3:

Loop Unroll: F[main] Loop %bb.4
  Loop Size = 7
  Can't unroll; loop not terminated by a conditional branch in the latch or header.
Loop Unroll: F[main] Loop %bb.2
  Loop Size = 12
  will not try to unroll loop with runtime trip count -unroll-runtime not given

(3) The outer loop is however unrolled:

Loop Unroll: F[main] Loop %bb.1
  Loop Size = 17
Trying runtime unrolling on Loop:
Loop at depth 1 containing:
%bb.1<header><exiting>,%bb.2,%bb.3<latch>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit,%bb.2.preheader
    Loop at depth 2 containing: %bb.2<header><exiting>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit<latch>
        Loop at depth 3 containing: %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
Using prolog remainder.
Loop latch not terminated by a conditional branch.
UNROLLING loop %bb.1 by 63!

(4) After the first unrolling we got lots of loops, basically 63 versions of
each of the inner loops (so the list below is heavily reduced):

Loop at depth 1 containing: %bb.7<header><latch>
Loop at depth 1 containing: %bb.1<header><exiting>, .............
   .......................
   Loop at depth 2 containing: %bb.2<header><exiting>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit<latch>
        Loop at depth 3 containing: %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
    Loop at depth 2 containing: %bb.2.1<header><exiting>,%bb.4.preheader.1,%bb.4.1,%bb.5.1,%bb.2.loopexit.1<latch>,%bb.6.1
        Loop at depth 3 containing: %bb.4.1<header><exiting>,%bb.5.1<exiting>,%bb.6.1<latch>
    Loop at depth 2 containing: %bb.2.2<header><exiting>,%bb.4.preheader.2,%bb.4.2,%bb.5.2,%bb.2.loopexit.2<latch>,%bb.6.2
        Loop at depth 3 containing: %bb.4.2<header><exiting>,%bb.5.2<exiting>,%bb.6.2<latch>
...
...
    Loop at depth 2 containing: %bb.2.61<header><exiting>,%bb.4.preheader.61,%bb.4.61,%bb.5.61,%bb.2.loopexit.61<latch>,%bb.6.61
        Loop at depth 3 containing: %bb.4.61<header><exiting>,%bb.5.61<exiting>,%bb.6.61<latch>
    Loop at depth 2 containing: %bb.2.62<header><exiting>,%bb.4.preheader.62,%bb.4.62,%bb.5.62,%bb.2.loopexit.62<latch>,%bb.6.62
        Loop at depth 3 containing: %bb.4.62<header><exiting>,%bb.5.62<exiting>,%bb.6.62<latch>

Loop at depth 1 containing: %bb.7<header><latch>

(5) When we get to the unroller the second time (after loop rotate) we still
got all these 63 loops at depth 2, each with one inner loop so we have 63 depth
3 loops as well. This time unrolling is possible for the depth 2 loops. So we
kind of get the following 63 times (but for bb.5.1, %bb.5.2, ..., %bb.5.62):

Loop Unroll: F[main] Loop %bb.4.preheader.1
  Loop Size = 327
  will not try to unroll loop with runtime trip count -unroll-runtime not given
Loop Unroll: F[main] Loop %bb.5.2
  Loop Size = 6
Trying runtime unrolling on Loop:
Loop at depth 3 containing: %bb.5.2<header><exiting>,%bb.4.2<latch><exiting>
Using prolog remainder.
Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled!
UNROLLING loop %bb.5.2 by 53!

Whenever we decide to unroll a loop that decision is partly made based on the
size of the inner loop (even for unroll by loop pragma there is one large
threshold that should not be exceeded). I guess that is one reason for always
unrolling the innermost loop first. But in this case we end up unrolling the
outer loop first. Since the decision to unroll the outer loop is based on that
we have not unrolled the inner loop (that would possibly give a much larger
cost and exceeding our thresholds) it seems wrong to allow unrolling of the
inner loops at the second execution of loop-unroll. That seems like a general
problem with the unrolling to me.

Wouldn't it be better to mark inner loops as "already unrolled" whenever we
decide to unroll an outer loop?

For example adding something like this

+  // Decisions for unrolling (thresholds etc) have been based on that we do not
+  // unroll inner loops later (even if they for example have unroll count
+  // pragmas). So we better mark inner loops as already unrolled. Go ahead and
+  // do it directly before we start to rewrite L.
+  for (Loop *SubLoop : *L)
+    SubLoop->setLoopAlreadyUnrolled();

in llvm::UnrollLoop at this place: https://github.com/llvm/llvm-
project/blob/8b0780f795eb58fca0a2456e308adaaa1a0b5013/llvm/lib/Transforms/Utils/LoopUnroll.cpp#L530
Quuxplusone commented 4 years ago
(In reply to bjorn.a.pettersson from comment #11)
> I've been looking a bit at
>
>   opt
> -passes='print<loops>,unroll,print<loops>,loop(rotate),print<loops>,unroll'
> -unroll-threshold=0 bbi-32266_4_with_minsize.ll -o - -S
> -debug-only=loop-unroll
>
> where bbi-32266_4_with_minsize.ll is the "really reduced reproducer" with a
> #0 added on the function to set the minsize attribute.
>
>
> (1) Before the first unroll we have these loops:
>
> Loop at depth 1 containing: %bb.7<header><latch>
> Loop at depth 1 containing:
> %bb.1<header><exiting>,%bb.2,%bb.3<latch>,%bb.4,%bb.5,%bb.6
>     Loop at depth 2 containing:
> %bb.2<header><exiting>,%bb.4<latch>,%bb.5<latch>,%bb.6
>         Loop at depth 3 containing:
> %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
> Loop at depth 1 containing: %bb.7<header><latch>
>
>
> (2) And then the unroller fails to unroll the inner loops on depth 2 and 3:
>
> Loop Unroll: F[main] Loop %bb.4
>   Loop Size = 7
>   Can't unroll; loop not terminated by a conditional branch in the latch or
> header.
> Loop Unroll: F[main] Loop %bb.2
>   Loop Size = 12
>   will not try to unroll loop with runtime trip count -unroll-runtime not
> given
>
> (3) The outer loop is however unrolled:
>
> Loop Unroll: F[main] Loop %bb.1
>   Loop Size = 17
> Trying runtime unrolling on Loop:
> Loop at depth 1 containing:
> %bb.1<header><exiting>,%bb.2,%bb.3<latch>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,
> %bb.2.loopexit,%bb.2.preheader
>     Loop at depth 2 containing:
> %bb.2<header><exiting>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.
> loopexit<latch>
>         Loop at depth 3 containing:
> %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
> Using prolog remainder.
> Loop latch not terminated by a conditional branch.
> UNROLLING loop %bb.1 by 63!
>
>
> (4) After the first unrolling we got lots of loops, basically 63 versions of
> each of the inner loops (so the list below is heavily reduced):
>
> Loop at depth 1 containing: %bb.7<header><latch>
> Loop at depth 1 containing: %bb.1<header><exiting>, .............
>    .......................
>    Loop at depth 2 containing:
> %bb.2<header><exiting>,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.
> loopexit<latch>
>         Loop at depth 3 containing:
> %bb.4<header><exiting>,%bb.5<exiting>,%bb.6<latch>
>     Loop at depth 2 containing:
> %bb.2.1<header><exiting>,%bb.4.preheader.1,%bb.4.1,%bb.5.1,%bb.2.loopexit.
> 1<latch>,%bb.6.1
>         Loop at depth 3 containing:
> %bb.4.1<header><exiting>,%bb.5.1<exiting>,%bb.6.1<latch>
>     Loop at depth 2 containing:
> %bb.2.2<header><exiting>,%bb.4.preheader.2,%bb.4.2,%bb.5.2,%bb.2.loopexit.
> 2<latch>,%bb.6.2
>         Loop at depth 3 containing:
> %bb.4.2<header><exiting>,%bb.5.2<exiting>,%bb.6.2<latch>
> ...
> ...
>     Loop at depth 2 containing:
> %bb.2.61<header><exiting>,%bb.4.preheader.61,%bb.4.61,%bb.5.61,%bb.2.
> loopexit.61<latch>,%bb.6.61
>         Loop at depth 3 containing:
> %bb.4.61<header><exiting>,%bb.5.61<exiting>,%bb.6.61<latch>
>     Loop at depth 2 containing:
> %bb.2.62<header><exiting>,%bb.4.preheader.62,%bb.4.62,%bb.5.62,%bb.2.
> loopexit.62<latch>,%bb.6.62
>         Loop at depth 3 containing:
> %bb.4.62<header><exiting>,%bb.5.62<exiting>,%bb.6.62<latch>
>
> Loop at depth 1 containing: %bb.7<header><latch>
>
>
> (5) When we get to the unroller the second time (after loop rotate) we still
> got all these 63 loops at depth 2, each with one inner loop so we have 63
> depth 3 loops as well. This time unrolling is possible for the depth 2
> loops. So we kind of get the following 63 times (but for bb.5.1, %bb.5.2,
> ..., %bb.5.62):
>
> Loop Unroll: F[main] Loop %bb.4.preheader.1
>   Loop Size = 327
>   will not try to unroll loop with runtime trip count -unroll-runtime not
> given
> Loop Unroll: F[main] Loop %bb.5.2
>   Loop Size = 6
> Trying runtime unrolling on Loop:
> Loop at depth 3 containing: %bb.5.2<header><exiting>,%bb.4.2<latch><exiting>
> Using prolog remainder.
> Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled!
> UNROLLING loop %bb.5.2 by 53!
>
>
> Whenever we decide to unroll a loop that decision is partly made based on
> the size of the inner loop (even for unroll by loop pragma there is one
> large threshold that should not be exceeded). I guess that is one reason for
> always unrolling the innermost loop first. But in this case we end up
> unrolling the outer loop first. Since the decision to unroll the outer loop
> is based on that we have not unrolled the inner loop (that would possibly
> give a much larger cost and exceeding our thresholds) it seems wrong to
> allow unrolling of the inner loops at the second execution of loop-unroll.
> That seems like a general problem with the unrolling to me.
>
> Wouldn't it be better to mark inner loops as "already unrolled" whenever we
> decide to unroll an outer loop?
>
> For example adding something like this
>
> +  // Decisions for unrolling (thresholds etc) have been based on that we do
> not
> +  // unroll inner loops later (even if they for example have unroll count
> +  // pragmas). So we better mark inner loops as already unrolled. Go ahead
> and
> +  // do it directly before we start to rewrite L.
> +  for (Loop *SubLoop : *L)
> +    SubLoop->setLoopAlreadyUnrolled();
>
> in llvm::UnrollLoop at this place:
> https://github.com/llvm/llvm-project/blob/
> 8b0780f795eb58fca0a2456e308adaaa1a0b5013/llvm/lib/Transforms/Utils/
> LoopUnroll.cpp#L530

Minor correction, we actually need something like this to mark all nestled sub
loops as already unrolled:

  for (Loop *InnerLoop : L->getLoopsInPreorder())
    if (InnerLoop != L)
      InnerLoop->setLoopAlreadyUnrolled();
Quuxplusone commented 4 years ago
Thanks or looking into this. I think the problem is that with minsize/optsize,
we do not skip the early exit checking threshold=0. We should probably respect
the user provided threshold=0 as a way to disable loop-unrolling.

id you mean 'getNumOccurrences'?
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 4c2b079c6bb..0e9ca841b53 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -1050,8 +1050,7 @@ static LoopUnrollResult tryToUnrollLoop(

   // Exit early if unrolling is disabled. For OptForSize, we pick the loop size
   // as threshold later on.
-  if (UP.Threshold == 0 && (!UP.Partial || UP.PartialThreshold == 0) &&
-      !OptForSize)
+  if ((UnrollThreshold.getNumOccurrences() > 0 && UP.Threshold == 0) &&
(!UP.Partial || UP.PartialThreshold == 0))
     return LoopUnrollResult::Unmodified;

   SmallPtrSet<const Value *, 32> EphValues;

But I am not sure if that's actually the right way to handle this case. IMO
using -unroll-threshold=0 is kind of a hack. To disable all unrolling at the
clang level, -fno-unroll-loops should be used instead.

In the test case, we have unroll counts in the metadata, and that takes
precedence over any cost modeling, unless the threshold=0. I would be curious
what the original input/use case was.

The observation about the handling of unrolling nested loops is interesting!
But I think not directly related to the issue here.
Quuxplusone commented 4 years ago
(For the record, I'm working on the same project as Mikael who wrote this PR,
and we have just stopped using -unroll-threshold=0 to disable unrolling and we
are now using -fno-unroll-loops instead, as well as TTI hooks to set the
threshold to 0 in UnrollingPreferences for our target in certain situations.)

Checking "UnrollThreshold.getNumOccurrences() > 0" would make it different to
set the threshold to 0 using UnrollingPreferences and when using the command
line option. That might be confusing, but on the other hand those settings
should have different priorities (right?). By checking getNumOccurrences() we
can distinguish where the value comes from (while still using a single
UP.Threshold variable), so that might be useful.

One question is if -unroll-threshold=0 on the command line should override
forced unroll (e.g. by pragma). Right now it is doing that when !OptForSize,
but not with OptForSize (that is weird and hard to grasp for me).
Letting -unroll-threshold=0 override forced unroll regardless of OptForSize
would make sense to me.

Another question is if UP.Threshold=0 set by TTI hook should override forced
unroll (e.g. by pragma). Right now it is doing that when !OptForSize, but not
with OptForSize (that is weird and hard to grasp for me).
Here I assume that we want the pragma to have higher precedence, regardless of
OptForSize. Or we should bail our regardless of OptForSize. I guess that using
UP.Threshold=1 (kind of) gives the expected behavior right now, that forced
unroll does not depend on OptForSize.

To summarize, I think it is weird that Threshold=0 has different impact
depending on OptForSize. It could make a little bit more sense if OptForSize
disabled forced unrolling in combination with Threshold=0, but right now it is
doing the opposite.

About the test case:

The test case here is a little bit weird and maybe not that realistic (I think
that it is a reduced test case created by some fuzzy test case generator).

By having two nestled loops with #pragma unroll count(X) and #pragma unroll
count(Y) the test case actually should expect to get unrolling by a factor X*Y.
So maybe this PR should ask why that only happens when we add the
optsize/minsize attributes (given that UP.Threshold=0).

About the analysis at "2019-12-04 20:06:50 CET":

Sorry for not being that clear about my earlier analysis. That was kind of a
follow up to the earlier comment from Florian about "There still might be an
issue with the unrolling cost modeling (or further down the pipeline), if it
believes the unrolled loop is the same size as the original loop.".

Given that this specific test case has forced unroll by pragma unroll count the
user has requested X*Y unrolls. So it is not incorrect in itself. But I think
there is an upper threshold also for pragma unroll, and that cost analysis does
not work if we end up unrolling the outer loop before the inner loop (so
depending on the X and Y values unrolling should be limited, right?).

I actually think the problem related to "unrolling an outer loop first, based
on the cost of having a rolled inner loop, and later ending up unrolling all
the inner loops and by that breaking the cost analysis done for the outer loop"
could happen also without forced unroll, without OptForSize etc. It would make
sense to treat that as a different problem, so I can write a separate PR for
that.