Closed seanlatias closed 4 years ago
Following we describe the algorithm for correct bound inferring for reuse buffers.
Assume we can represent all the reusable algorithms as follows.
for x in range(0, N):
if x in range(a, b): # with resue pattern
B[x] = foo(A[x+a1], A[x+a2], ..., A[x+am])
else: # without reuse pattern
B[x] = goo()
For the reuse pattern, we can find the two values a_min
and a_max
. Now let's start the algorithm.
[a+a_min, b+a_max]
reuse_dist = a_max - a_min
reuse_dist - a_min = a_max
.[0, N+a_max]
.N >= b
, therefore, the upper bound is always N+a_max
. The lower bound should not be lower than 0, otherwise, we will be accessing pixels out of bound. Thus, the iteration space is actually [0, N+a_max]
. So we can create the loop below.for x in range(0, N+a_max):
# do something
for x in range(0, N+a_max):
if x in range(a+a_min, b+a_max):
# update reuse buffer => shift + read from input
for x in range(0, N+a_max):
if x in range(a+a_min, b+a_max):
# update reuse buffer => shift + read from input
if x >= a_max: # the offset
x1 = x - a_max
# recover back the original logic
if x in range(a, b):
B[x] = foo(A[x+a1], A[x+a2], ..., A[x+am])
else: # without reuse pattern
B[x] = goo()
for x in range(0, N+a_max):
if x in range(a+a_min, b+a_max):
# update reuse buffer => shift + read from input
if x >= a_max: # the offset
x1 = x - a_max
# recover back the original logic
if x in range(a, b):
B[x] = foo(R[0], R[1], ..., R[m-1])
else: # without reuse pattern
B[x] = goo()
Why does offset equal to reuse_dist - a_min = a_max
. I can understand reuse_dist = a_max - a_min
. Is there a typo? otherwise, offset = reuse_dist - a_min = a_max - 2* a_min
In this PR, we will implement the following features.
Automatic Bound Inferring
With the existing
reuse_at
primitive, we assume that there is no user-specified padding. However, padding is everywhere. Thus, we need to be able to correctly infer the bound where the data will be stored in the reuse buffers. Following we show two examples that should work correctly after this PR.The users can definitely create another stage for padding tensor
A
in the above example. However, this could affect the performance since we introduce unnecessary memory read/write.Support Multiple Output Tensors Reusing the Same Input Tensor
The existing
reuse_at
primitive only supports one output tensor reusing one input tensor. However, multiple outputs can reuse the same input. In this case, we should create a single reuse buffer that can cover all reuse area. Note that since we have multiple outputs, with the current HeteroCL semantics, this can only be described using imperative DSL. Following we show an example, where we can actually create a line buffer and a window buffer for tensor B and C to reuse the input A.Support One Output Tensor Reusing Multiple Input Tensors
This is the inverse case of the previous one. In this case, we should create separate reuse buffers for different inputs. In the following example, we have an extra dimension for C to store the different outputs. In this case, we need to use two separate
reuse_at
primitives to generate two reuse buffers.Other Notes
We will see if it is practical to add a mixing case of the above two cases. Namely, multiple outputs reusing multiple input tensors. In this case, usually the better way is to separate the case into smaller cases.