Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Excessive loop unrolling #41957

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR42987
Status NEW
Importance P enhancement
Reported by agner@agner.org (agner@agner.org)
Reported on 2019-08-13 10:22:36 -0700
Last modified on 2020-01-19 20:54:30 -0800
Version 6.0
Hardware PC All
CC blitzrakete@gmail.com, chandlerc@gmail.com, craig.topper@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, florian_hahn@apple.com, hfinkel@anl.gov, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, llvm@meinersbur.de, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR26532, PR44593
I think we need to re-evaluate the advantages and disadvantages of loop
unrolling.

Clang is often unrolling loops excessively in cases where there is no advantage
in unrolling.

Loop unrolling is advantageous when the loop overhead is costly or when
expressions or branches that depend on the loop counter can be simplified. But
loop unrolling gives no advantage when the bottleneck lies elsewhere.

The limiting factor is likely to be the floating point/vector unit in the CPU
if a loop contains floating point or vector code. The loop overhead is often
reduced to an integer addition and a fused compare/branch instruction.
The integer unit has plenty of resources to run the loop overhead
simultaneously with the floating point or vector code at zero extra cost.

The situation is no better if the instruction decoder is the bottleneck, which
is quite often the case. A tiny loop will fit into the micro-op cache or
loopback buffer of modern CPUs so that the loop will run on decoded
instructions only. A large unrolled loop is unlikely to fit into these buffers,
which means that the unrolled loop is slower.

Even if the unrolled loop is not slower when measured in isolation, it can slow
down other parts of the program because it consumes excessive amounts of code
cache.

A simple example:

const int size = 58;
double a[size], b[size], c[size];

void test () {
    for (int i = 0; i < size; i++) {
        a[i] = b[i] + c[i];
    }
}

clang -O2 -m64 will unroll this loop completely up to size = 59

clang -O3 -m64 will unroll this loop completely up to size = 119

Clang is vectorizing the loop, which is a good thing, but there is no advantage
in unrolling further.

Literature: I have described the loopback buffer, micro-op cache, and other
details of different CPUs in the manual "The microarchitecture of Intel, AMD
and VIA CPUs" https://www.agner.org/optimize/microarchitecture.pdf
Quuxplusone commented 5 years ago
We have, indeed, always considered full loop unrolling part of the early
canonicalization process. We do, essentially, limit the partial unrolling
factor based on an estimate of the number of uops and the target-specific size
of the uop cache (the thresholds are now set by the LoopMicroOpBufferSize
variable in the various lib/Target/X86/X86Sched*.td files). Full unrolling,
however, we don't limit in the same way. I believe the rationale was that full
unrolling tends to enable other optimizations, and so we do this limited only
by some heuristic practicality threshold.

It might be interesting to conduct the following experiment. In
include/llvm/CodeGen/BasicTTIImpl.h, in getUnrollingPreferences, where we have
this:

  UP.PartialThreshold = MaxOps;

add:

  UP.Threshold = MaxOps;

and see how that affects things.
Quuxplusone commented 5 years ago
As Hal mentioned, (full) unrolling may enable other optimizations. I think of
folding an index expression such as i*8+1 into a constant, remove a constant
table lookup, or the SLP vectorizer might vectorize a stream of instructions
that the LoopVectorizer cannot.

However, this requires different heuristics than we currently have which
estimates the code size in instructions. The inliner heusristic e.g. also takes
into account parameters which become constant.