Adds unroll-loops flag.

mrutt92 commented 4 years ago

The behavior of #pragma GCC unroll X is depends heavily on this flag being set.

The pass in GCC that responds to this pragma introduces anti-dependencies by reusing the same virtual register in each copy of the loop body. It relies on a later pass, which is not enabled by default, to go through and rename these virtual registers to remove those anti-dependencies. Without that pass, the effect cascades all the way through register allocation and the final instruction scheduling.

While -frename-registers fixes this but we have noticed that it results in slightly worse code than -funroll-loops. This likely due to other missing passes on which loop unrolling relies.

Example scalar-vector-add C code:

int svadd(int * __restrict O,
          int * __restrict I,
          int b,
          int N)
{
    #pragma GCC unroll 2
    for (int i = 0; i < N; ++i) {
        O[i] = I[i] + b;
    }
    return 0;
}

GCC output assembly without -funroll-loops:

svadd(int*, int*, int, int):
        blez    a3,.L2
        addiw   a3,a3,-1
        slli    a3,a3,32
        srli    a3,a3,30
        addi    a6,a1,4
        add     a3,a3,a6
        sub     a5,a3,a1
        addi    a5,a5,-4
        srli    a5,a5,2
        andi    a5,a5,1
        bnez    a5,.L3
        lw      a4,0(a1)
        addi    a0,a0,4
        mv      a1,a6
        addw    a5,a4,a2
        sw      a5,-4(a0)
        beq     a6,a3,.L2
.L3: 
// the unrolled loop body
        lw      a4,0(a1)
        addi    a1,a1,8
        addi    a0,a0,8
        addw    a5,a4,a2
// notice that there is a WAR dependence here on a4
// this forces the above addw to be scheduled before this load
        lw      a4,-4(a1)
        sw      a5,-8(a0)
        addw    a5,a4,a2
        sw      a5,-4(a0)
        bne     a1,a3,.L3
.L2:
        li      a0,0
        ret

GCC output with -funroll-loops:

svadd(int*, int*, int, int):
        blez    a3,.L2
        addiw   a3,a3,-1
        slli    t0,a3,32
        srli    t1,t0,30
        addi    a4,a1,4
        add     t2,t1,a4
        sub     a5,t2,a1
        addi    a6,a5,-4
        srli    a7,a6,2
        andi    t3,a7,1
        bnez    t3,.L3
        lw      t4,0(a1)
        addi    a0,a0,4
        mv      a1,a4
        addw    t5,t4,a2
        sw      t5,-4(a0)
        beq     a4,t2,.L2
.L3:
// here the WAR dependence has been removed and we can schedule the second load
// before the addw
        lw      t6,0(a1)
        lw      a3,4(a1)
        addi    a1,a1,8
        addw    t0,t6,a2
        addw    t1,a3,a2
        sw      t0,0(a0)
        sw      t1,4(a0)
        addi    a0,a0,8
        bne     a1,t2,.L3
.L2:
        li      a0,0
        ret

drichmond commented 4 years ago

Are there side-effects? Do we want this flag set in every kernel?

drichmond commented 4 years ago

I see that there's value (thank you for providing the code), but reading the flag description suggests that this will unroll EVERY loop with compile-time bounds.

https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Optimize-Options.html#Optimize-Options

-frename_registers seems like a better global flag even if it doesn't obtain the same performance. Then -funroll_loops can be applied selectively

mrutt92 commented 4 years ago

Well actually it will unroll loops where N can be determined at compile time or upon entry to the loop.

-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Enabled by -fprofile-use and -fauto-profile.

Still, your point is well taken given that I$ is a precious resource. What about just -frerun-cse-after-loop -fweb -frename-registers?

mrutt92 commented 4 years ago

To be a little more clear here about what value this provides:

Omitting these flags is catastrophic. We're advising users to make use of #pragma GCC unroll X because of the strong benefits of ILP in HammerBlade. But without these flags those benefits are completely unrealized because the unrolled code is effectively serialized.

The only reason we might have seen benefits from unrolling before is there is a special case where loops are completely unrolled in which the compiler does the right thing anyways. But if N is dynamic the benefits of unrolling are toast without these flags.

drichmond commented 4 years ago

I understand the value, just wanted to make sure we understood the cost, too.

drichmond commented 4 years ago

I like the second solution

mrutt92 commented 4 years ago

It's done.

mrutt92 commented 4 years ago

Sure - I thought I would clarify for the sake of posterity.

drichmond commented 4 years ago

Always appreciated

drichmond commented 4 years ago

Just to confirm - does this still unroll all loops?

mrutt92 commented 4 years ago

No.

bespoke-silicon-group / baseline

Adds unroll-loops flag. #34