cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
326 stars 92 forks source link

[WIP][API] Enhance the Current Support for Reuse Buffers #159

Closed seanlatias closed 4 years ago

seanlatias commented 4 years ago

In this PR, we will implement the following features.

Automatic Bound Inferring

With the existing reuse_at primitive, we assume that there is no user-specified padding. However, padding is everywhere. Thus, we need to be able to correctly infer the bound where the data will be stored in the reuse buffers. Following we show two examples that should work correctly after this PR.

# This is without padding
A = hcl.placeholder((10,))
r = hcl.reduce_axis(0, 3)
# Note that we explicitly change the output shape
B = hcl.compute((8, ), lambda x: hcl.sum(A[x+r], axis=r))
s = hcl.create_schedule([A, B])
s.reuse_at(A, s[B], B.axis[0])

# This is with user-specified padding
A = hcl.placeholder((10,))
r = hcl.reduce_axis(0, 3)
# Note that the output shape is the same as the input shape
B = hcl.compute(
    (10, ), 
    lambda x: hcl.select(x == 0, A[x] + A[x+1],
              hcl.select(x == 9, A[x-1] + A[x], hcl.sum(A[x-1+r], axis=r))))

The users can definitely create another stage for padding tensor A in the above example. However, this could affect the performance since we introduce unnecessary memory read/write.

Support Multiple Output Tensors Reusing the Same Input Tensor

The existing reuse_at primitive only supports one output tensor reusing one input tensor. However, multiple outputs can reuse the same input. In this case, we should create a single reuse buffer that can cover all reuse area. Note that since we have multiple outputs, with the current HeteroCL semantics, this can only be described using imperative DSL. Following we show an example, where we can actually create a line buffer and a window buffer for tensor B and C to reuse the input A.

A = hcl.placeholder((10, 10))
B = hcl.placeholder((8, 8))
C = hcl.placeholder((8, 8))
r = hcl.reduce_axis(0, 3)
c = hcl.reduce_axis(0, 3)
def update_func(x, y):
    B[y, x] = hcl.sum(A[y+r, x], axis=r)
    C[y, x] = hcl.sum(A[y, x+c], axis=c)
D = hcl.mutate((8, 8), lambda y, x: update_func(y, x))
s = hcl.create_schedule([A, B, C, D])
LB = s.reuse_at(A, s[D], D.axis[0])
WB = s.reuse_at(LB, s[D], D.axis[1])

Support One Output Tensor Reusing Multiple Input Tensors

This is the inverse case of the previous one. In this case, we should create separate reuse buffers for different inputs. In the following example, we have an extra dimension for C to store the different outputs. In this case, we need to use two separate reuse_at primitives to generate two reuse buffers.

A = hcl.placeholder((10, 10))
B = hcl.placeholder((10, 10))
r = hcl.reduce_axis(0, 3)
c = hcl.reduce_axis(0, 3)
D = hcl.compute(
    (8, 8, 2), 
    lambda y, x, z: hcl.select(z == 0, hcl.sum(A[y+r, x], axis=r),
                                       hcl.sum(B[y, x+c], axis=c))
s = hcl.create_schedule([A, B, D])
RA = s.reuse_at(A, s[D], D.axis[0])
RB = s.reuse_at(B, s[D], D.axis[1])

Other Notes

We will see if it is practical to add a mixing case of the above two cases. Namely, multiple outputs reusing multiple input tensors. In this case, usually the better way is to separate the case into smaller cases.

seanlatias commented 4 years ago

Following we describe the algorithm for correct bound inferring for reuse buffers.

Assume we can represent all the reusable algorithms as follows.

for x in range(0, N):
  if x in range(a, b): # with resue pattern
    B[x] = foo(A[x+a1], A[x+a2], ..., A[x+am])
  else: # without reuse pattern
    B[x] = goo()

For the reuse pattern, we can find the two values a_min and a_max. Now let's start the algorithm.

  1. The input reuse range is [a+a_min, b+a_max]
  2. The reuse distance is reuse_dist = a_max - a_min
  3. Now we can calculate the offset distance. Namely, how many pixels we should load into the resue buffer before we can start the calculation. It can be calculated by reuse_dist - a_min = a_max.
  4. The extended iteration space will be the original space plus the offset [0, N+a_max].
  5. Now, the final iteration space will be the union of input reuse space and the extended iteration space. Also, we know that N >= b, therefore, the upper bound is always N+a_max. The lower bound should not be lower than 0, otherwise, we will be accessing pixels out of bound. Thus, the iteration space is actually [0, N+a_max]. So we can create the loop below.
for x in range(0, N+a_max):
# do something
  1. Now we can insert the reuse logic, which should cover the input resue space.
for x in range(0, N+a_max):
  if x in range(a+a_min, b+a_max):
    # update reuse buffer => shift + read from input
  1. Next, we can insert the original computation. Note that we can only do that after we have enough values in the reuse buffer.
for x in range(0, N+a_max):
  if x in range(a+a_min, b+a_max):
    # update reuse buffer => shift + read from input
  if x >= a_max: # the offset
    x1 = x - a_max
    # recover back the original logic
    if x in range(a, b):
      B[x] = foo(A[x+a1], A[x+a2], ..., A[x+am])
    else: # without reuse pattern
      B[x] = goo()
  1. Finally, we should replace the input access with the reuse buffer.
for x in range(0, N+a_max):
  if x in range(a+a_min, b+a_max):
    # update reuse buffer => shift + read from input
  if x >= a_max: # the offset
    x1 = x - a_max
    # recover back the original logic
    if x in range(a, b):
      B[x] = foo(R[0], R[1], ..., R[m-1])
    else: # without reuse pattern
      B[x] = goo()
hecmay commented 4 years ago

Why does offset equal to reuse_dist - a_min = a_max. I can understand reuse_dist = a_max - a_min. Is there a typo? otherwise, offset = reuse_dist - a_min = a_max - 2* a_min