cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
326 stars 92 forks source link

Support explicit unroll at certain loop axis #308

Closed hecmay closed 3 years ago

hecmay commented 4 years ago

Aside from unrolling a loop implicitly (i.e. by adding #pragma unroll, and let the EDA tools unroll the loop), we also want to unroll a loop into multiple PEs explicitly. This allows users to generate multiple PEs for single stage, and connect the PEs in different ways to generate custom dataflow accelerators.

An example of 1D convolution kernel:

def kernel(W, X):
        k = hcl.reduce_axis(0, K)
        return hcl.compute((size,), lambda x: sum(X[x+k]*W[k]), "Y")

# unroll the inner loop into PEs
pes = s[kernel].unroll(axis=1)
pe0, pe1, pe2 = pes

Each PE returned by the unroll() primitive will correspond to a different (non-inlined) kernel function call. HCL compiler should create separate kernel definitions and function calls for each PE.

For the 1D convolution example above, assume the loop trip count is 3. In this case, we will generate three separate functions (i.e. pe1, pe1, pe2), and call them in a dataflow region, so that they can run in parallel:

void pe0() {
    //...
}

void pe1() {
    //...
}

void pe2() {
    //...
}

void top() {
    #pragma dataflow
    pe0();
    pe1();
    pe2();
}
hecmay commented 4 years ago

@seanlatias @zhangzhiru

zhangzhiru commented 4 years ago

Looks good. This is pretty much what we agreed on. To distinguish from the current unrolling support, maybe we should use another primitive, say paralle(), to indicate the explicit duplication?

zhangzhiru commented 4 years ago

another (perhaps cleaner) solution is to look at left hand side of the statement when we call this primitive. If we return a list of named objects, we explicitly duplicate the loop body.

hecmay commented 4 years ago

I will try to add a parallel() primitive firts to avoid messing up anything in the original unroll() primitive. We can switch to the second solution later.

Since we need to create some new stages in the schedule, we may need to do something like s.parallel(stage, axis=1) (IR transformation in the schedule level) instead of s[stage].parallel(axis=1) (i.e. IR transformation inside the stage).

hecmay commented 3 years ago

Support added already. Test cases: https://github.com/cornell-zhang/heterocl/blob/heteroflow/tests/test_schedule_systolic.py