cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

Loops with trip count one are not eliminated thoroughly #272

Open chhzh123 opened 4 years ago

chhzh123 commented 4 years ago

Some loops with trip count one cannot be eliminated by current simplification logic (see below), when hcl.compute is accompanied by the attr argument, which is very common in current hlib implementation. https://github.com/cornell-zhang/heterocl/blob/d3173471e877c32fd9327e882575499c46f10f69/tvm/HalideIR/src/arithmetic/Simplify.cpp#L4795-L4798

It may cause errors when Vivado HLS automatically unrolls the loops and blurs the boundary of the dataflow region.

seanlatias commented 4 years ago

Can you show your use case? I'm thinking maybe I wrote this for some specific reasons. I guess Simplify is used in many places and sometimes we don't want to lose the attribute information.

seanlatias commented 4 years ago

I'm wondering what kind of HLS error you are facing. A loopcount=1 loop shouldn't cause any error in HLS (?)

seanlatias commented 4 years ago

I sort of know why I did this. It was because we wanted to connect our flow with PolySA before. Actually, the attributes in hlib might not be that useful now.

chhzh123 commented 4 years ago

Can you show your use case? I'm thinking maybe I wrote this for some specific reasons. I guess Simplify is used in many places and sometimes we don't want to lose the attribute information.

This case is somehow tricky, but it indeed causes the error. See the code below.

def test_dataflow():
    A = hcl.placeholder((1,10), "A")

    def kernel(A):
        B = hcl.compute(A.shape, 
                lambda i, j: A[i, j] + 1, "B", attrs=OrderedDict([('app',tvm.make.StringImm('B'))]))
        C = hcl.compute(B.shape,
                lambda i, j: B[i, j] + 1, "C", attrs=OrderedDict([('app',tvm.make.StringImm('C'))]))
        D = hcl.compute(C.shape,
                lambda i, j: C[i, j] + 1, "D", attrs=OrderedDict([('app',tvm.make.StringImm('D'))]))
        return D

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csyn")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.D, target.host)
    s.to(kernel.B, s[kernel.C])
    s.to(kernel.C, s[kernel.D])
    f = hcl.build(s, target)
    np_A = np.zeros((1,10))
    np_D = np.zeros((1,10))
    hcl_A = hcl.asarray(np_A)
    hcl_D = hcl.asarray(np_D)
    f(hcl_A, hcl_D)

attrs are attached to hcl.compute, thus loop with trip count 1 cannot be eliminated.

void test(bit32 A[1][10], bit32 D[1][10]) {
    bit32 B_pipe_1[1][10];
    #pragma HLS stream variable=B_pipe_1 depth=1
    #pragma HLS dataflow
    B_i: for (bit32 i = 0; i < 1; ++i) {
      B_j: for (bit32 j = 0; j < 10; ++j) {
        bit32 B_temp;
        B_temp = (A[i][j] + 1);
        B_pipe_1[i][j] = B_temp;
      }
    }
    bit32 C_pipe_2[1][10];
    #pragma HLS stream variable=C_pipe_2 depth=2
    C_i1: for (bit32 i1 = 0; i1 < 1; ++i1) {
      C_j1: for (bit32 j1 = 0; j1 < 10; ++j1) {
        bit32 B_temp1;
        B_temp1 = B_pipe_1[i1][j1];
        bit32 C_temp;
        C_temp = (B_temp1 + 1);
        C_pipe_2[i1][j1] = C_temp;
      }
    }
    D_i2: for (bit32 i2 = 0; i2 < 1; ++i2) {
      D_j2: for (bit32 j2 = 0; j2 < 10; ++j2) {
        bit32 C_temp1;
        C_temp1 = C_pipe_2[i2][j2];
        D[i2][j2] = (C_temp1 + 1);
      }
    }
  }

Then, when this piece of code passes to Vivado HLS, the loops will be automatically unrolled. After that, Vivado HLS cannot distinguish different stages (only one function Block_codeRepl8_proc7 is detected here), causing synthesis error.

INFO: [XFORM 203-502] Unrolling small iteration loop 'B_i' (kernel.cpp:16) in function 'test' automatically.
INFO: [XFORM 203-502] Unrolling small iteration loop 'C_i1' (kernel.cpp:25) in function 'test' automatically.
INFO: [XFORM 203-502] Unrolling small iteration loop 'D_i2' (kernel.cpp:34) in function 'test' automatically.
INFO: [XFORM 203-501] Unrolling loop 'B_i' (kernel.cpp:16) in function 'test' completely.
INFO: [XFORM 203-501] Unrolling loop 'C_i1' (kernel.cpp:25) in function 'test' completely.
INFO: [XFORM 203-501] Unrolling loop 'D_i2' (kernel.cpp:34) in function 'test' completely.
INFO: [XFORM 203-712] Applying dataflow to function 'test', detected/extracted 1 process function(s): 
         'Block_codeRepl8_proc7'.
ERROR: [XFORM 203-123] Cannot stream  'C_pipe_2.V2': a local variable is streamable only if it is in a dataflow region.
ERROR: [HLS 200-70] Pre-synthesis failed.
seanlatias commented 4 years ago

@chhzh123, so if the loops are eliminated, they can work? I'm wondering if the HLS tool will unroll loop with not only trip count = 1 but maybe like trip count = 2. If that's the case, it will be more like an HLS bug.

chhzh123 commented 4 years ago

If the loops are eliminated, they can work?

Yes, it can work. HLS can detect three functions here.

seanlatias commented 4 years ago

I just tried trip count=2 and it works. So I guess the special case when trip count=1 will cause the problem. @zhangzhiru do you think this is an HLS bug? Although we can definitely remove the loops with trip count=1 by ourselves but that shouldn't be necessary.

chhzh123 commented 4 years ago

Maybe the best way is to generate modules/functions to explicitly distinguish different stages.

seanlatias commented 4 years ago

@chhzh123 please go ahead and remove that logic in your code for now. I'll need to double-check to see if that logic is indeed needed.