cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

[API][Backend] Add Support for Constant Tensors #269

Closed seanlatias closed 3 years ago

seanlatias commented 4 years ago

In this PR, we develop a new API hcl.const_tensor that allows users to declare a constant tensor. The initial values can be given from a Python list or a NumPy array. Example usage is as follows.

def kernel():
    A = hcl.const_tensor([[2, 2], [3, 4], [1, 2]], "A")
    return hcl.compute(A.shape, lambda x, y: A[x, y]+1, "B")
s = hcl.create_schedule([], kernel)
f = hcl.build(s)

Or, with a NumPy array

def kernel():
    A = hcl.const_tensor(np.array([[2, 2], [3, 4], [1, 2]]), "A")
    return hcl.compute(A.shape, lambda x, y: A[x, y]+1, "B")
s = hcl.create_schedule([], kernel)
f = hcl.build(s)

Since they are constant tensors, we do not allow users to initialize a constant tensor from another HeteroCL tensor. Moreover, in this PR, we also implement the codegen for HLS C. The above code will result in the following HLS codes. An extra header file that contains all constants will be generated.

// global_consts.h
const ap_int<32> A[3][2] = {{2, 2}, {3, 4}, {1, 2}};
// top.cpp
#include "global_consts.h"
void default_function(ap_int<32> B[3][2]) {
  for (ap_int<32> x = 0; x < 3; ++x) {
    for (ap_int<32> y = 0; y < 2; ++y) {
      B[x][y] = A[x][y] + 1;
    }
  }
}

Unit Tests: Please refer to tests/test_compute_basic.py. All data types with different shapes are tested.

Known Issues: Slow CPU execution due to current implementation. I haven't come up with a better solution yet. I'll file an issue to solve this separately. For now, we encourage people to use CSIM if they declare a large constant array (e.g., more than 1000 elements).

zhangzhiru commented 4 years ago

@chhzh123 Can you try it out?

chhzh123 commented 4 years ago

It's weird that using hcl.const_tensor may significantly increase the running time of CPU simulation. Also, some results involving fixed points seem incorrect, and I'm figuring out why.

seanlatias commented 4 years ago

@chhzh123 for fixed-point numbers, you know you need to specify the data type in the API, right? We would not be able to infer it.

seanlatias commented 4 years ago

@chhzh123 also, you mean comparing with hcl.copy it's much slower?

chhzh123 commented 4 years ago

@chhzh123 for fixed-point numbers, you know you need to specify the data type in the API, right? We would not be able to infer it.

Yes, I have specified the data type in the API, but the result of my BNN is incorrect. I'm checking if some of the layers go wrong.

chhzh123 commented 4 years ago

@chhzh123 also, you mean comparing with hcl.copy it's much slower?

No, I mean hcl.const_tensor is much slower than the previous implementation that directly passes the tensors in the function arguments. I also notice that not only LLVM slows down, but HLS gets slower (from 10min synthesis to 70min).

zhangzhiru commented 4 years ago

I mean hcl.const_tensor is much slower than the previous implementation that directly passes the tensors in the function arguments.

It's possible if we are declaring a const array with a huge size. How large is the weight tensor? Do you see a similar slow down with the small examples.

chhzh123 commented 4 years ago

I mean hcl.const_tensor is much slower than the previous implementation that directly passes the tensors in the function arguments.

It's possible if we are declaring a const array with a huge size. How large is the weight tensor? Do you see a similar slow down with the small examples.

The largest weight tensor has 4096 fixed-point numbers (batch norm layer). I tested design for one convolutional layer, which didn't show observable slow down.

seanlatias commented 4 years ago

The largest weight tensor has 4096 fixed-point numbers (batch norm layer). I tested design for one convolutional layer, which didn't show observable slow down.

How many constant numbers are there in total?

chhzh123 commented 4 years ago

The largest weight tensor has 4096 fixed-point numbers (batch norm layer). I tested design for one convolutional layer, which didn't show observable slow down.

How many constant numbers are there in total?

About 6k for the small BNN.

seanlatias commented 4 years ago

@chhzh123 I think there are still some bugs with this API in terms of CPU simulation. Please go ahead and use CSIM instead.

chhzh123 commented 4 years ago

I tested several methods this week and found it would be better to declare these large const arrays as global variables, i.e. declaring them before the top function. For my small BNN design, if the weight tensors are declared as local variables, it takes 2h to complete HLS. However, if I move the const tensors outside the function, it only takes 4min to finish!

seanlatias commented 3 years ago

The description is updated according to the fixes. Also, a known issue is added for slow CPU execution.

seanlatias commented 3 years ago

@zhangzhiru, please see if the HLS codegen looks good to you. Not sure what's the best name for the header file that contains the constant arrays.

zhangzhiru commented 3 years ago

I suppose we can create a header file per constant array using the name of the corresponding tensor?

seanlatias commented 3 years ago

I suppose we can create a header file per constant array using the name of the corresponding tensor?

Yes, we can do that.