Inconsistent numbers of FIFO reads and writes

chhzh123 commented 4 years ago

When doing data streaming, it's important to guarantee the numbers of FIFO reads and writes are the same. We need analysis for streaming buffers to ensure they work in a correct way.

Following is an example that illustrates inconsistent FIFO reads and writes, where C accumulates the bits in each element of B.

def test():
    dtype = hcl.UInt(8)
    A = hcl.placeholder((10,), "A", dtype)

    def kernel(A):
        B = hcl.compute(A.shape, 
                lambda i: A[i] + 1, "B", dtype)
        rb = hcl.reduce_axis(0, 8)
        C = hcl.compute(B.shape,
                lambda i: hcl.sum(B[i][rb], axis=rb), "C", dtype)
        return C

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csim|csyn")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.C, target.host)
    s.to(kernel.B, s[kernel.C])
    f = hcl.build(s, target)
    np_A = np.zeros((10,))
    np_C = np.zeros((10,))
    hcl_A = hcl.asarray(np_A)
    hcl_C = hcl.asarray(np_C)
    f(hcl_A, hcl_C)

Currently, HeteroCL only replaces the original buffers to streaming buffers without further code transformation, which causes reading empty FIFO in stage C.

void test(ap_uint<8> A[10], ap_uint<8> C[10]) {
    ap_uint<8> B[10];
    ap_uint<8> B_pipe_1[10];
    #pragma HLS dataflow
    #pragma HLS stream variable=B_pipe_1 depth=1
    B_i: for (bit32 i = 0; i < 10; ++i) {
      ap_uint<8> B_temp;
      B_temp = ((ap_uint<8>)(((ubit32)A[i]) + 1U));
      B_pipe_1[i] = B_temp;
      B[i] = B_temp;
    }
    ap_uint<8> LB;
    C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 sum;
      sum = 0;
      C_ra0: for (bit32 ra0 = 0; ra0 < 8; ++ra0) {
        ap_uint<8> B_temp1;
        B_temp1 = B_pipe_1[i1];
        sum = ((bit32)(((ap_int<34>)B_temp1[ra0]) + ((ap_int<34>)sum)));
      }
      C[i1] = ((ap_uint<8>)sum);
    }
  }

In this case, B_pipe_1 should be read outside the inner loop. Otherwise, it will be read 80 times, though it only has 10 elements.

chhzh123 commented 4 years ago

.reuse_at cannot capture this kind of pattern and is not able to move B_pipe_1 outside.

chhzh123 commented 4 years ago

This line of code specifies what tensor to be reused, and HeteroCL recognizes it but does not generate the corresponding C code. (The C code is totally the same before and after adding this instruction.)

s.reuse_at(kernel.B,s[kernel.C],kernel.B.axis[0],"LB")

chhzh123 commented 4 years ago

Actually, this problem can be found when using hls::stream to do C simulation. After changing the code below,

void test(ap_uint<8> A[10], ap_uint<8> C[10]) {
    hls::stream<ap_uint<8> > B_pipe_1;
    #pragma HLS dataflow
    #pragma HLS stream variable=B_pipe_1 depth=1
    B_i: for (bit32 i = 0; i < 10; ++i) {
      ap_uint<8> B_temp;
      B_temp = ((ap_uint<8>)(((ubit32)A[i]) + 1U));
      B_pipe_1.write(B_temp);
    }
    ap_uint<8> LB;
    C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 sum;
      sum = 0;
      C_ra0: for (bit32 ra0 = 0; ra0 < 8; ++ra0) {
        ap_uint<8> B_temp1;
        B_temp1 = B_pipe_1.read();
        sum = ((bit32)(((ap_int<34>)B_temp1[ra0]) + ((ap_int<34>)sum)));
      }
      C[i1] = ((ap_uint<8>)sum);
    }
  }

Vivado HLS will give the warning.

WARNING: Hls::stream 'hls::stream<ap_uint<8>, 0>.1' is read while empty, which may result in RTL simulation hanging.

zhangzhiru commented 4 years ago

@seanlatias any comment on the reuse_at problem?

seanlatias commented 4 years ago

I guess the reason is that, for current reuse_at algorithm, I assume the sliding window moves. But in this case, the sliding window is stationary (i.e., it is not sliding).

seanlatias commented 4 years ago

Or maybe more specific, I only handle the case when slide=1. This is the case where slide=0.

zhangzhiru commented 4 years ago

Can we first prompt a warning when we fail to do data reuse?

chhzh123 commented 4 years ago

Vivado HLS will give the warning.

WARNING: Hls::stream 'hls::stream<ap_uint<8>, 0>.1' is read while empty, which may result in RTL simulation hanging.

However, Vivado HLS does not give the exact line that causes the warning, which would be intractable when debugging a large design. Thus, it's necessary for HeteroCL to count the number of FIFO reads and writes, and prompt errors when they are not consistent.

zhangzhiru commented 4 years ago

Before we use static analysis, we should first add some support for runtime validation in HeteroCL. I believe we can leverage many Python-specific features to instrument the code and check that the legality of the stream/FIFO accesses --- they must be continuous and non-repetitive, plus the consumption rate must match the production.

seanlatias commented 4 years ago

Before we use static analysis, we should first add some support for runtime validation in HeteroCL. I believe we can leverage many Python-specific features to instrument the code and check that the legality of the stream/FIFO accesses --- they must be continuous and non-repetitive, plus the consumption rate must match the production.

This is actually what I can do for my course project.

cornell-zhang / heterocl

Inconsistent numbers of FIFO reads and writes #288