Missed opportunity to reduce number of loop iterations?

Quuxplusone commented 6 years ago


Bugzilla Link	PR39702
Status	NEW
Importance	P enhancement
Reported by	Vedant Kumar (vsk@apple.com)
Reported on	2018-11-18 20:35:49 -0800
Last modified on	2018-11-26 12:04:27 -0800
Version	trunk
Hardware	PC All
CC	dblaikie@gmail.com, ditaliano@apple.com, efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

In llvm, there's plenty of hot code that looks like this:

  if (pred_size(BB) != 2)
    return false; // From InstCombiner::mergeStoreIntoSuccessor.

Here, pred_size takes linear time, but there's never a need to visit more than two predecessors. In a stage2 Release build of llc, we do 45 million unnecessary linked list iterations just at this one call site [1].

Is there a way for the optimizer to recognize that no more than two iterations are needed?

Reduced example (https://godbolt.org/z/TOtQD6):

struct linked_list {
    int x = 0;
    linked_list *next = nullptr;
};

int size(linked_list *ll) {
    int s = 0;
    while (ll) {
        ++s;
        ll = ll->next;
    }
    return s;
}

int no_more_than_two_iterations(linked_list *ll) {
    int s = size(ll);
    if (s != 2)
        return 0;
    return 1;
}

Currently we produce:

define dso_local i32 @_Z27no_more_than_two_iterationsP11linked_list(%struct.linked_list* readonly) local_unnamed_addr #0 {
  %2 = icmp eq %struct.linked_list* %0, null
  br i1 %2, label %13, label %3

; <label>:3: ; preds = %1, %3
  %4 = phi i32 [ %6, %3 ], [ 0, %1 ]
  %5 = phi %struct.linked_list* [ %8, %3 ], [ %0, %1 ]
  %6 = add nuw nsw i32 %4, 1
  %7 = getelementptr inbounds %struct.linked_list, %struct.linked_list* %5, i64 0, i32 1
  %8 = load %struct.linked_list*, %struct.linked_list** %7, align 8, !tbaa !2
  %9 = icmp eq %struct.linked_list* %8, null
  br i1 %9, label %10, label %3

; <label>:10: ; preds = %3
  %11 = icmp eq i32 %6, 2
  %12 = zext i1 %11 to i32
  ret i32 %12

; <label>:13: ; preds = %1
  ret i32 0
}

Could we somehow make the loop exit early if the sum ever exceeds 2? Like:

  %9 = icmp eq %struct.linked_list* %8, null
  %enough-seen = icmp ugt i32 %6, 2
  %early-exit = or i1 %9, %enough-seen
  br i1 %early-exit, label %10, label %3

[1] I defined 'wasted iterations' as pred_size(BB) - 2, whenever that difference would be positive. I did the stage2 build with r347179.

Quuxplusone commented 5 years ago

In terms of the LLVM source code itself, we should probably add something analogous to Value::hasNUses().

In terms of optimizing code like this, we might be able to figure it out. If a loop has no side-effects (memory writes/atomics/etc.), and the only value live out of the loop is an induction variable, and the only use of that induction variable is an equality comparison, and the induction variable can't wrap, we can transform the loop to exit early.

Quuxplusone commented 5 years ago

In some ways I'd love to see this just fixed in the optimizers without needing to change LLVM's source (you know what I mean - yes, changing the optimizer is changing the source... but not changing all the call sites that query the length of linked lists). Just from a purist sort of perspective that'd be "nice".

That said, yeah, querying for a specific linked list size (or greater than, or less than, etc) seems like something APIs should support, and I imagine at some point maybe someone'll propose it for std::{forward_}list & having similar APIs for our lists is probably quite reasonable rather than relying on optimization heroics.

(& I don't really have the skills/domain knowledge to really work on fixing this)

Eli/others - do you happen to know any particular optimizations that might be doing any parts of this work that could/should be enhanced to do the rest? Or in what optimization pass (if there is an existing one) this could/should live in generally?

Some kind of loop unrolling perhaps? Well I guess it doesn't have to be unrolled at all, it could involve the comparison being moved into the loop condition. (guess that's an optimization tradeoff though - if a list was always of size 2 or 3, comparing on every iteration might slow things down (the comparison at 1 would always fail, for instance)?)

Quuxplusone commented 5 years ago

Probably makes sense as a separate loop optimization pass; none of our existing loop passes change the trip count, or do any similar analysis of induction variables. We have some existing code in SCEV for figuring out whether a loop has side-effects, and checking the live-out values is trivial in LCSSA form, so it's probably not that hard to write.

Probably the transform itself should move the comparison into the loop, and then we can unroll it afterwards if it makes sense. Even if we always want to unroll, it's simpler to rewrite the latch, as opposed to trying to modify the unroller. I haven't really considered what the heuristics should look like, though.

Quuxplusone / LLVMBugzillaTest

Missed opportunity to reduce number of loop iterations? #38675