LLVM failed at deleting my loops

Quuxplusone commented 13 years ago


Bugzilla Link	PR11034
Status	RESOLVED FIXED
Importance	P normal
Reported by	Chandler Carruth (chandlerc@gmail.com)
Reported on	2011-09-28 17:57:42 -0700
Last modified on	2011-10-22 14:58:45 -0700
Version	trunk
Hardware	PC All
CC	atrick@apple.com, baldrick@free.fr, clattner@nondot.org, geek4civic@gmail.com, llvm-bugs@lists.llvm.org, llvm@sunfishcode.online, michael@lunarg.com, nicholas@mxc.ca
Fixed by commit(s)
Attachments	`arr1.ll` (1323 bytes, text/plain) `pr11034-1.patch` (12808 bytes, text/plain) `pr11034-2.patch` (14115 bytes, text/plain) `pr11034-3.patch` (13076 bytes, text/plain) `pr11034-4.patch` (13351 bytes, text/plain)
Blocks
Blocked by
See also

I was hacking on something totally different, and went to show the *obviously*
trivial code to Nick. And LLVM didn't delete it. WTF.

; ModuleID = 'arr1.cc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.X = type { [27 x i8], [2 x i8] }

@_ZZ5testXiiPcE1x = private unnamed_addr constant %struct.X { [27 x i8]
c"abcd\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00",
[2 x i8] c"a\00" }, align 1

define i32 @_Z5testXiiPc(i32 %i, i32 %j, i8* nocapture %c) nounwind uwtable
readnone {
entry:
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %arrayidx = getelementptr inbounds %struct.X* @_ZZ5testXiiPcE1x, i64 0, i32 0, i64 %indvars.iv
  %0 = load i8* %arrayidx, align 1, !tbaa !0
  %conv = sext i8 %0 to i32
  %add = add nsw i32 %conv, %sum.02
  %indvars.iv.next = add i64 %indvars.iv, 1
  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
  %exitcond = icmp eq i32 %lftr.wideiv, 27
  br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body
  ret i32 %add
}

!0 = metadata !{metadata !"omnipotent char", metadata !1}
!1 = metadata !{metadata !"Simple C/C++ TBAA", null}

Why is there a loop body here? The result is statically known....

My favorite part: if you make the array size 18, it works. But 19 or higher,
and it gives up. =[

Quuxplusone commented 13 years ago

For reference, compare the following two for.body blocks:

--- original ----
for.body:                                         ; preds = %for.body, %entry
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %arrayidx = getelementptr inbounds %struct.X* @_ZL2gx, i64 0, i32 0, i64 %indvars.iv
  %0 = load i8* %arrayidx, align 1, !tbaa !0
  %conv = sext i8 %0 to i32
  %add = add nsw i32 %conv, %sum.02
  %indvars.iv.next = add i64 %indvars.iv, 1
  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
  %exitcond = icmp eq i32 %lftr.wideiv, 19
  br i1 %exitcond, label %for.end, label %for.body
----

---- mine ----
for.body:                                         ; preds = %for.body, %entry
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %arrayidx = getelementptr inbounds %struct.X* @_ZL2gx, i64 0, i32 0, i64 %indvars.iv
  %0 = load i8* %arrayidx, align 1, !tbaa !0
  %conv = sext i8 %0 to i32
  %add = add nsw i32 %conv, %sum.02
  %indvars.iv.next = add i64 %indvars.iv, 1
  %exitcond = icmp eq i32 %indvars.iv.next, 19
  br i1 %exitcond, label %for.end, label %for.body
----

I've removed the pointless trunc on the indvar. That's it. Mine gets deleted
like it should. =[

Also, specifying -enable-iv-rewrite makes this go away.

Quuxplusone commented 13 years ago

With -enable-iv-rewrite the loop test will be rewritten as icmp i64, so no trunc is inserted.

We now leave the loop test as-is without any good reason to change it, since a trunc should be free. But why is the trunc pointless? And how can you remove it without changing the icmp width?

Using a small array results in full unrolling, so straight line DCE kicks in.

I'm not sure yet why the trunk interferes with this optimization... need to dig a little deeper.

Quuxplusone commented 13 years ago

(In reply to comment #2)
> With -enable-iv-rewrite the loop test will be rewritten as icmp i64, so no
> trunc is inserted.
>
> We now leave the loop test as-is without any good reason to change it, since a
> trunc should be free. But why is the trunc pointless? And how can you remove
it
> without changing the icmp width?

Sorry, I only meant it was pointless as the only use of the trunc was an icmp
with a literal. I removed it *and* changed the icmp width.

> I'm not sure yet why the trunk interferes with this optimization... need to
dig
> a little deeper.

Sure. I'm not worried about the existence of the trunc, I'm just mystified by
the loop not getting deleted, and noticed that it was deleted if I did the
rewrite by hand.

Quuxplusone commented 13 years ago

Which pass does the optimization? I'm not seeing optimized code from opt -O3
-enable-iv-rewrite. If you attach the exact .ll and opt command, I'd like to
see the "obvious" optimization in action.

Quuxplusone commented 13 years ago

Attached arr1.ll (1323 bytes, text/plain): Test case

Quuxplusone commented 13 years ago

Ok thanks.  I was using your original test case with 27 iterations and saw no
optimization. With 19 iterations and -enable-iv-rewrite=true, the loop is fully
unrolled, thus GVN'd and DCE'd.

With -enable-iv-rewrite=false (now default), there is one extra trunc. That
instruction x19 puts it over the default unroll threshold:

  Loop Size = 8
  Too large to fully unroll with count: 19 because size: 152>150

You can work around it with -unroll-threshold=8x(array size).

You would probably be in worse shape with -Os.

I'm looking into changing the loop size estimator.

Quuxplusone commented 13 years ago

(In reply to comment #6)
> Ok thanks.  I was using your original test case with 27 iterations and saw no
> optimization. With 19 iterations and -enable-iv-rewrite=true, the loop is
fully
> unrolled, thus GVN'd and DCE'd.

Ahh! That makes sense.

> With -enable-iv-rewrite=false (now default), there is one extra trunc. That
> instruction x19 puts it over the default unroll threshold:
>
>   Loop Size = 8
>   Too large to fully unroll with count: 19 because size: 152>150
>
> You can work around it with -unroll-threshold=8x(array size).
>
> You would probably be in worse shape with -Os.
>
> I'm looking into changing the loop size estimator.

Cool. I wonder, is there a general way to determine that unrolling this loop
(regardless of the loop body size) is OK because we can statically compute the
result? I find it a bit alarming that the ability to analyze and compute
statically known things relies on unrolling the loop completely, but that may
just be a naive understanding of the problems...

Anyways, feel free to close this bug or keep it open if you think its useful
for tracking your size-estimation work.

Quuxplusone commented 13 years ago

You're right. Unrolling should be driven by identification of opportunities, not a context independent size threshold. I suggest you file another bug on that. I plan to close this one after making a minor fix to the hueristics.

Quuxplusone commented 13 years ago

Actually, I was wondering why it wasn't being constant evaluated by the
constant string analysis logic in ComputeExitLimitFromICmp:

  // Handle common loops like: for (X = "string"; *X; ++X)

I haven't had the chance to look at it yet though, but I think that should
probably be extended to handle this case before we ever get to unrolling the
loop.

Quuxplusone commented 13 years ago

I don't think we have any problem computing a trip count. Rather, we're unable to compute the exit value of the Sum based on that trip count. But you could probably take some of the logic from ComputeExitLimitFromICmp to handle constant strings, and apply it in getConstantEvolutionLoopExitValue to solve the problem without unrolling.

Quuxplusone commented 13 years ago

The difference in behavior with -enable-iv-rewrite should go aware after r140919. I'll leave this open in case Nick wants to implement a new "string constant evolution" optimization.

Quuxplusone commented 13 years ago

You made quick work of this problem. I reviewed it as best I could and
it seems solid. Comments...

+ DenseMap<PHINode *, Constant *> PrevIterPHIVals = PHIVals;

Why do you need to copy the map? You don't want to reuse the phi's
previous values in the next iteration. So can't you just create an
empty nextIterPHIVAls map, fill it in for each phi, then std::swap
maps before moving to the next iteration?

We should have comments in BuildConstantFromSCEV explaining why it is
necessary. It's not immediately obvious why the constant folding
wasn't handled in the usual SCEV creation step. I actually had to step
through the code in a debugger to figure out why you needed it. Does
it all stem from the problem that pointer constants are wrapped in
SCEVUnknown?

By evolving all phis within getSCEVAtScope, it looks like we're making
this one operation quadratic in the number of phis. Maybe it would be
better to evolve all phis at once and cache those values across
getSCEVAtScope. Do we want a comment or TODO regarding this?

Quuxplusone commented 13 years ago

Attached pr11034-1.patch (12808 bytes, text/plain): patch

Quuxplusone commented 13 years ago

I'm investigating enc-rc4, mandel-2, dhrystone to see if this optimization is really doing the right thing here.

I may also glance at compile time to make sure it's not a problem.

Quuxplusone commented 13 years ago

Attached pr11034-2.patch (14115 bytes, text/plain): patch

Quuxplusone commented 13 years ago

Ok. I do want this patch to go in, but I should take a look at the aforementioned benchmarks first. I just got hit with a totally unexpected and unrelated problem, so it may take me a couple more days before I get back to this. Let me know if you're blocked on it.

Quuxplusone commented 13 years ago

With my test-suite make options (-O3, no -flto) mandel-2, dry, and enc-rc4 are untouched by the new constant folding.

However, I did find several other benchmarks that are potentially affected. I'll have compile and execution time results for those tomorrow morning, and we can go ahead with the patch if it looks reasonable.

Quuxplusone commented 13 years ago

With patch-3 I'm hitting this assertion:
Assertion failed: (canConstantEvolve(I, L) && "cannot evaluate expression in
this loop"), function EvaluateExpression, file
/Volumes/Storage/patch2/lib/Analysis/ScalarEvolution.cpp, line 4761.

I haven't debugged further. So it could be a simple merge problem. But it is
preventing me from looking at benchmark results because the failing benchmark
list is almost identical to the benchmarks in which the constant folding kicks
in:

External/Povray/povray
External/SPEC/CFP2006/447.dealII/447.dealII
External/SPEC/CINT2000/175.vpr/175.vpr
External/SPEC/CINT2000/176.gcc/176.gcc
External/SPEC/CINT2000/253.perlbmk/253.perlbmk
External/SPEC/CINT2000/254.gap/254.gap
External/SPEC/CINT2006/445.gobmk/445.gobmk
External/SPEC/CINT2006/456.hmmer/456.hmmer
External/SPEC/CINT2006/464.h264ref/464.h264ref
External/SPEC/CINT95/099.go/099.go
External/SPEC/CINT95/132.ijpeg/132.ijpeg
MultiSource/Applications/ClamAV/clamscan
MultiSource/Applications/JM/ldecod/ldecod
MultiSource/Applications/JM/lencod/lencod
MultiSource/Benchmarks/Bullet/bullet
MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1
MultiSource/Benchmarks/Fhourstones/fhourstones
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
MultiSource/Benchmarks/PAQ8p/paq8p
MultiSource/Benchmarks/Prolangs-C/assembler/assembler
MultiSource/Benchmarks/Prolangs-C/simulator/simulator
MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url
MultiSource/Benchmarks/mediabench/gsm/toast/toast
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
SingleSource/Benchmarks/Shootout-C++/matrix
SingleSource/Benchmarks/Shootout/matrix
SingleSource/Benchmarks/Stanford/IntMM

Quuxplusone commented 13 years ago

Attached pr11034-3.patch (13076 bytes, text/plain): patch

Quuxplusone commented 13 years ago

Attached pr11034-4.patch (13351 bytes, text/plain): patch

Quuxplusone commented 13 years ago

I benchmarked Nick's latest patch on x86 and ARM. Full llvm test-suite including SPEC. I didn't see any clear wins, but no regressions either. Just the usual noise. I was concerned about compile time, but strangely my data shows improved compile times after the patch. That's just weird, but I won't question it.

I say go ahead and commit this optimization since we know it improves something.

Quuxplusone commented 13 years ago

Thanks! Committed in r142731.

Improved compile time is not surprising; when SCEV fails to fully solve a loop, it will get queries for that loop over and over again (sometimes without SCEV having been preserved in between) and will spend a lot of time trying to resolve it. Solving it once up front is faster.

Quuxplusone / LLVMBugzillaTest

LLVM failed at deleting my loops #11227