Closed mrutt92 closed 4 years ago
Are there side-effects? Do we want this flag set in every kernel?
I see that there's value (thank you for providing the code), but reading the flag description suggests that this will unroll EVERY loop with compile-time bounds.
https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Optimize-Options.html#Optimize-Options
-frename_registers seems like a better global flag even if it doesn't obtain the same performance. Then -funroll_loops can be applied selectively
Well actually it will unroll loops where N can be determined at compile time or upon entry to the loop.
-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.
Enabled by -fprofile-use and -fauto-profile.
Still, your point is well taken given that I$ is a precious resource.
What about just -frerun-cse-after-loop -fweb -frename-registers
?
To be a little more clear here about what value this provides:
Omitting these flags is catastrophic. We're advising users to make use of #pragma GCC unroll X
because of the strong benefits of ILP in HammerBlade. But without these flags those benefits are completely unrealized because the unrolled code is effectively serialized.
The only reason we might have seen benefits from unrolling before is there is a special case where loops are completely unrolled in which the compiler does the right thing anyways. But if N is dynamic the benefits of unrolling are toast without these flags.
I understand the value, just wanted to make sure we understood the cost, too.
I like the second solution
It's done.
Sure - I thought I would clarify for the sake of posterity.
Always appreciated
Just to confirm - does this still unroll all loops?
No.
The behavior of
#pragma GCC unroll X
is depends heavily on this flag being set.The pass in GCC that responds to this pragma introduces anti-dependencies by reusing the same virtual register in each copy of the loop body. It relies on a later pass, which is not enabled by default, to go through and rename these virtual registers to remove those anti-dependencies. Without that pass, the effect cascades all the way through register allocation and the final instruction scheduling.
While
-frename-registers
fixes this but we have noticed that it results in slightly worse code than-funroll-loops
. This likely due to other missing passes on which loop unrolling relies.Example scalar-vector-add C code:
GCC output assembly without
-funroll-loops
:GCC output with
-funroll-loops
: