Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Re-evaluate loop unrolling for size #32100

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR33128
Status NEW
Importance P enhancement
Reported by Davide Italiano (ditaliano@apple.com)
Reported on 2017-05-22 10:47:24 -0700
Last modified on 2017-05-22 16:15:50 -0700
Version trunk
Hardware PC Windows NT
CC davidxl@google.com, eraman@google.com, filcab@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, simon.f.whittaker@gmail.com, tejohnson@google.com, xinliangli@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Currently, LLVM doesn't run -loop-unroll at all at -Os. GCC, OTOH, does it, and this allows them to reduce the size of the generated code, e.g. when unrolling loops with small (constant) trip counts:

https://godbolt.org/g/PpF4DG

extern int f(int);

int tinkywinky(void) { int x; for (int i = 0; i < 2; ++i) { x += f(i); } return x; }

[davide@cupiditate unroll]$ gcc -Os unroll.c -c -o unroll-gcc.o [davide@cupiditate unroll]$ ../clang -Os unroll.c -c -o unroll-clang.o [davide@cupiditate unroll]$ size unroll-gcc.o text data bss dec hex filename 80 0 0 80 50 unroll-gcc.o [davide@cupiditate unroll]$ size unroll-clang.o text data bss dec hex filename 86 0 0 86 56 unroll-clang.o

We could also consider having this driven by profile informations to unroll regions that are hot (and never touch regions that are cold). This has to make conservative decisions as loop unrolling is an optimization that generally increases the size of the final executable.

CC:ing some GCC/PGO experts for thoughts.

Quuxplusone commented 7 years ago

Also, note, GCC seems to ignore attribute((cold)) in this case [or at least can prove that the size win in every case, here].

Quuxplusone commented 7 years ago

Complete unroll may result in code size reduction, so GCC allows it. By default, complete unrolling is not allowed to increase size, but with O3 or -funroll-loops etc on, the size increase is allowed.

LLVM can possibly do the same.

Quuxplusone commented 7 years ago
(In reply to Xinliang David Li from comment #2)
> Complete unroll may result in code size reduction, so GCC allows it. By
> default, complete unrolling is not allowed to increase size, but with O3 or
> -funroll-loops etc on, the size increase is allowed.
>
> LLVM can possibly do the same.

I'll try. One possible concern is that you can't very accurately estimate the
size before codegen so that could result in size increase anyway. I guess I'll
run some experiments to see what's the behaviour.

While I have your attention, I've been also thinking (inspired by the inliner)
of trying to be more aggressive (even at -Os) and unroll hot loops [i.e.
increase the threshold], and at the same time being more conservative for cold
loops [i.e. decrease the threshold].

When profile informations are not available, this is like flipping a coin, but
when we have profile data, it's a different story. My argument here is that
anecdotally/in my experience large part of the program is cold, so, that could
result in a size win.

[This is not necessarily -Os specific, a similar change could be made for -Os I
guess].

I don't know if GCC uses a similar heuristic. Does this make sense to you?
Quuxplusone commented 7 years ago

Right. Our experience is PGO is not only a performance win, but a size win in general given that most of the code is cold. GCC sets optimize for size to true for cold functions.

Quuxplusone commented 7 years ago
(In reply to Davide Italiano from comment #3)
> (In reply to Xinliang David Li from comment #2)
> > Complete unroll may result in code size reduction, so GCC allows it. By
> > default, complete unrolling is not allowed to increase size, but with O3 or
> > -funroll-loops etc on, the size increase is allowed.
> >
> > LLVM can possibly do the same.
>
> I'll try. One possible concern is that you can't very accurately estimate
> the size before codegen so that could result in size increase anyway. I
> guess I'll run some experiments to see what's the behaviour.
>
> While I have your attention, I've been also thinking (inspired by the
> inliner) of trying to be more aggressive (even at -Os) and unroll hot loops
> [i.e. increase the threshold], and at the same time being more conservative
> for cold loops [i.e. decrease the threshold].
>
> When profile informations are not available, this is like flipping a coin,
> but when we have profile data, it's a different story. My argument here is
> that anecdotally/in my experience large part of the program is cold, so,
> that could result in a size win.
>
> [This is not necessarily -Os specific, a similar change could be made for
> -Os I guess].
>

[What I mean is: "this is not necessarily -Os specific, a similar change could
be made for -O3"]

> I don't know if GCC uses a similar heuristic. Does this make sense to you?
Quuxplusone commented 7 years ago

Actually, LLVM used to unroll at -Os but now it doesn't anymore. https://reviews.llvm.org/D23388

I'm under the impression this needs to be evaluated again.

Also, here's an example showing that we're completely ignoring profile informations:

$ cat unroll.c

include

include

include

extern uint64_t ext(uint64_t arg);

attribute((noinline)) uint64_t f1(uint64_t x) { for (unsigned i = 0; i < 5; ++i) x /= ext(x); return x; }

attribute((noinline)) uint64_t f2(uint64_t y) { for (unsigned i = 0; i < 5; ++i) y *= ext(y); return y; }

int main(int argc, char **argv) { uint64_t res = 0;

if (argc != 3) { printf("usage: test iter1 iter2\n"); return -1; }

uint64_t x = atoi(argv[1]); uint64_t y = atoi(argv[2]);

for (unsigned i = 0; i < x; ++i) { res += f1(x); }

for (unsigned i = 0; i < y; ++i) { res += f2(y); }

printf("%lu\n", res); return 0; }

$ cat helper.c

include

uint64_t ext(uint64_t arg) { return arg; }

Ran with ./unroll 10000000 1 Results in the following counters:

$ ~/work/llvm/build-release/bin/llvm-profdata show -all-functions -detailed-summary -counts patatino.profdata Counters: main: Hash: 0x000000000000a104 Counters: 4 Function count: 1 Block counts: [0, 100000000, 1] ext: Hash: 0x0000000000000000 Counters: 1 Function count: 1000000010 Block counts: [] f1: Hash: 0x0000000000000004 Counters: 2 Function count: 100000000 Block counts: [1000000000] atoi: Hash: 0x0000000000000000 Counters: 1 Function count: 2 Block counts: [] f2: Hash: 0x0000000000000004 Counters: 2 Function count: 1 Block counts: [10] Functions shown: 5 Total functions: 5 Maximum function count: 1000000010 Maximum internal block count: 1000000000 Detailed summary: Total number of blocks: 10 Total count: 2200000025 2 blocks with count >= 1000000000 account for 80 percentage of the total counts. 2 blocks with count >= 1000000000 account for 90 percentage of the total counts. 4 blocks with count >= 100000000 account for 95 percentage of the total counts. 4 blocks with count >= 100000000 account for 99 percentage of the total counts. 4 blocks with count >= 100000000 account for 99.9 percentage of the total counts. 4 blocks with count >= 100000000 account for 99.99 percentage of the total counts. 4 blocks with count >= 100000000 account for 99.999 percentage of the total counts.

But in the final executable, both loops get unrolled. I think f2 shouldn't have been unrolled.