[API][Schedule][Pass] Add reuse_at primitive to HeteroCL

In this PR, we introduce a new primitive called reuse_at, which creates a reuse buffer according to the target tensor that will be reused and the specified axis. Users can even specify a reuse buffer as the reuse target for other reuse buffers. Following is a simple example.

A = hcl.placeholder((10, 10))
B = hcl.compute((10, 8), lambda y, x: A[y, x] + A[y, x+1] + A[y, x+2])
s = hcl.create_schedule([A, B])
reuse_buffer = s.reuse_at(A, s[B], B.axis[1])

In this example, we generate a reuse buffer that reuses (i.e., stores) the values of tensor A along the direction of x-axis. Following is an example of the generated IR.

for y in (0, 10)
  for x in (0, 8)
    reuse A
    B[y][x] = A[y, x] + A[y, x+1] + A[y, x+2]

You can see that there is a new IR node Reuse. The backend can decide whether to lower the IR node or keep it unchanged. By default, it is lowered to other nodes. The shape of the reuse buffer is inferred automatically by the compiler. In this case, the shape is [3]. Following is an example of generated IR.

allocate reuse_buffer[3]
for y in (0, 10)
  for x.reuse in (0, 10)
    for i in (0, 2)
      reuse_buffer[i] = reuse_buffer[i+1]
    reuse_buffer[2] = A[y][x.reuse]
    if (x.reuse >= 2)
      B[y][x.reuse-2] = reuse_buffer[0] + reuse_buffer[1] + reuse_buffer[2]

As you can see, now the target tensor A is read only once for each x iteration, which is what we want for the reuse behavior (Note that originall we need to read three pixels from A for each x iteration). Also, the main computation now reads directly from the reuse buffer. Following are some examples of what computation patterns can be reused and what cannot.

A[y][x] + A[y][x+1] + A[y][x+2] reuse at x This infers a shape of [3].
A[y][x] + A[y][x+1] + A[y][x+2] reuse at y There is nothing that we can reuse because there is no overlapped pixels for two consecutive y iterations.
A[y][x] + A[y+1][x] + A[y+2][x] reuse at y This infers a shape of [10, 3].
A[y][x] + A[y+1][x+1] + A[y+2][x+2] reuse at x This infers a shape of [3, 3]. Note that the reuse buffer will cover a rectangular area instead of just three pixels that are used for the computation.
A[y][x+r] reuse at x, where r is a reduce axis in (0, 3) This infers a shape of [3]. The reduction operation works the same as normal opertaions.
A[y][2*x+r] reuse at x, where r is a reduce axis in (0, 3) This pattern is not allowed. Currently we do not support patterns with stride larger then 1.
A[y][x*x+r] reuse at x, where r is a reduce axis in (0, 3) This pattern is not allowed. We do not support operations with changable reuse area.

For more examples of how to use the reuse buffers please see the tests.

Following is the detailed algorithm of how a reuse buffer is generated.

1. How to infer the shape?

To infer the shape, we first collect all Load operations that read from the target. Let me use two examples to illustrate. For the following two examples, we assume we reuse at axis x with target A.

# example 1 - blur x
B[y][x] = A[y][x] + A[y][x+1] + A[y][x+2]
# example 2 - reduction operation where r, c = [0, 3)
B[y][x] += A[y+r][x+c]

Then, we can collect the Load operations.

# example 1
[[y, x], [y, x+1], [y, x+2]]
# example 2
[y+r, x+c]

Next, we find the minimum and maximum Load expression for each dimension by replacing all the axes beneath the reuse axis with its bound.

# example 1
min_expr = [y, x]
max_expr = [y, x+2]
# example 2
min_expr = [y, x]
max_expr = [y+2, x+2]

You can see that in the second example, the r and c are replaced with its corresponding min and max values. After that, we can infer the bound by substracting the maximum expression with the minimum expression and plus 1. If the bound is not constant, an error is reported.

# example 1
diff_expr = [1, 3]
# example 2
diff_expr = [3, 3]

The difference is the shape of our reuse buffer.

2. Check if there really is data reuse. We find the reuse dimension by comparing the area cover by two consecutive iterations of the specified reuse axis. We use the min_expr in step 1 for comparison.

# example 1
iteration 1: [y, x]
iteration 2: [y, x+1]
=> reuse dimension : 2nd dimension
# example 2
iteration 1: [y, x]
iteration 2: [y, x+1]
=> reuse dimension : 2nd dimension

Then, we compare the min_expr of the next iteration with the max_expr of the current expression and see if their ranges overlap. In this case, we only compare the reuse dimension.

# example 1
iteration 1 max_expr : x+2
iteration 2 min_expr : x+1
# example 2
iteration 1 max_expr : x+2
iteration 2 min_expr : x+1

In both cases, we can see that there does exist overlapping.

3. Now we can generate the reuse logic. We can use the following pseudocode to describe the logic.

for axes other than the reuse dim
  for reuse dim i in (0, bound-1)
    # shift reuse buffer
    reuse_buffer[..., i, ...] = reuse_buffer[..., i+1, ...]
  # read from the target
  reuse_buffer[..., bound-1, ...] = target[update_index]

We can see that the first thing we need to do here is to figure out the update_index. First, we replace all indices outside and include the reuse axis in the min_expr with 0.

# example 1 - replace x and y with 0
min_index   : [y, x]
reuse_index : [0, 0]
# example 2 - replace x and y with 0
min_index   : [y+r, x+c]
reuse_index : [r, c]

Then, we create a new loop variable for each dimension if the bound is not 1.

# example 1
new_loop_vars : [reuse.x]
# example 2
new_loop_vars : [reuse.y, reuse.x]

After that, we replace the reuse index in the min index with the new loop variables. Note that if the reuse index is 0 and it is also the reuse dimension, the update index needs to add the new loop variable.

# example 1
min_index: [y, x]
reuse_index: [0, 0]
update_index: [y, x+reuse.x]
# example 2
min_index: [y+r, x+c]
reuse_index: [r, c]
update_index: [y+reuse.y, x+reuse.x]

Finally, we can build the whoe logic with the new loop variables.

# example 1
for reuse.x in (0, 2)
  reuse_buffer[reuse.x] = reuse_buffer[reuse.x + 1]
reuse_buffer[2] = A[y, x+2]
# example 2
for reuse.y in (0, 3)
  for reuse.x in (0, 2)
    reuse_buffer[reuse.y, reuse.x] = reuse_buffer[reuse.y, reuse.x+1]
  reuse_buffer[reuse.y, 2] = A[y+reuse.y, x+2]

4. Combine the reuse logic with the original computation. This is also a bit tricky. To do so, we need to extend the original reuse axis. Namely, we need to initialize the reuse buffer first for each iteration, then we can perform the computation.

# example 1
original bound of x = (0, 8)
new bound = original bound + reuse bound = (0, 10)
# example 2
original bound of x = (0, 8)
new bound = original bound + reuse bound = (0, 10)

After that, we need to insert an IF statement to ensure the correctness of the program. We also need to recover the original index.

# example 1 - original x = new_x - 2
for y in (0, 10)
  for new_x in (0, 10)
    for reuse.x in (0, 2)
      reuse_buffer[reuse.x] = reuse_buffer[reuse.x + 1]
    reuse_buffer[2] = A[y, (new_x-2)+2]
    if (new_x >= 2)
      # perform computation
# example 2 - original x = new_x - -2
for y in (0, 8)
  for new_x in (0, 10)
    for reuse.y in (0, 3)
      for reuse.x in (0, 2)
        reuse_buffer[reuse.y, reuse.x] = reuse_buffer[reuse.y, reuse.x+1]
      reuse_buffer[reuse.y, 2] = A[y+reuse.y, (new_x-2)+2]
    if (new_x >= 2)
      # perform computation

Finally, we can build the main computation block. Basically, we just need to replace the target tensor with the reuse buffer. Also, for the index, we need to replace all indices above and including the reuse axis with 0.

# example 1
for y in (0, 10)
  for new_x in (0, 10)
    for reuse.x in (0, 2)
      reuse_buffer[reuse.x] = reuse_buffer[reuse.x + 1]
    reuse_buffer[2] = A[y, (new_x-2)+2]
    if (new_x >= 2)
      B[y, new_x-2] = reuse_buffer[0] + reuse_buffer[1] + reuse_buffer[2]
# example 2
for y in (0, 8)
  for new_x in (0, 10)
    for reuse.y in (0, 3)
      for reuse.x in (0, 2)
        reuse_buffer[reuse.y, reuse.x] = reuse_buffer[reuse.y, reuse.x+1]
      reuse_buffer[reuse.y, 2] = A[y+reuse.y, (new_x-2)+2]
    if (new_x >= 2)
      for r in (0, 3):
        for c in (0, 3):
          B[y, new_x-2] += reuse_buffer[r, c]

cornell-zhang / heterocl

[API][Schedule][Pass] Add reuse_at primitive to HeteroCL #92

Following is the detailed algorithm of how a reuse buffer is generated.