Closed hecmay closed 3 years ago
@seanlatias @zhangzhiru
Looks good. This is pretty much what we agreed on. To distinguish from the current unrolling support, maybe we should use another primitive, say paralle(), to indicate the explicit duplication?
another (perhaps cleaner) solution is to look at left hand side of the statement when we call this primitive. If we return a list of named objects, we explicitly duplicate the loop body.
I will try to add a parallel()
primitive firts to avoid messing up anything in the original unroll() primitive. We can switch to the second solution later.
Since we need to create some new stages in the schedule, we may need to do something like s.parallel(stage, axis=1)
(IR transformation in the schedule level) instead of s[stage].parallel(axis=1)
(i.e. IR transformation inside the stage).
Support added already. Test cases: https://github.com/cornell-zhang/heterocl/blob/heteroflow/tests/test_schedule_systolic.py
Aside from unrolling a loop implicitly (i.e. by adding #pragma unroll, and let the EDA tools unroll the loop), we also want to unroll a loop into multiple PEs explicitly. This allows users to generate multiple PEs for single stage, and connect the PEs in different ways to generate custom dataflow accelerators.
An example of 1D convolution kernel:
Each PE returned by the
unroll()
primitive will correspond to a different (non-inlined) kernel function call. HCL compiler should create separate kernel definitions and function calls for each PE.For the 1D convolution example above, assume the loop trip count is 3. In this case, we will generate three separate functions (i.e. pe1, pe1, pe2), and call them in a dataflow region, so that they can run in parallel: