hecmay commented 4 years ago

In this proposal we use HBM as an example. The channel or bank allocation for DDR and PLRAM fits well with the same interface proposed here.

The assignment of HBM channels comes along with compute unit (CU) replication. We are supposed to assign different channels to each arguments in each CU duplicate to maximize the bandwidth. Here is the proposed interface:

1) we can specify the kernel number (i.e. how many CU to duplicate) in the data movement API with splitting_factor option. In this case, multiple CU duplicates are created, inputs will be split evenly and assigned to different HBM channels (If the total # greater then 32, some arguments will be assigned to the same HBM channel)

2) split the input tensors in a single dimension using splitting_dim option. In this case, we can reshape the input tensors, and split the tensors along certain dimension. In this example, we split the input tensor along the 0-th dimension, and 16 CU duplicates are generated accordingly.

  A = hcl.placeholder(in_shape, name="A")
  B = hcl.placeholder(in_shape, name="B")

  def kernel(...):
      # algorithm...

  # create custom platform 
  config = {
      "host": hcl.device.cpu("intel", "e5"),
      "xcel": {
          hcl.device.fpga("xilinx", "xcvu19p"),
          hcl.device.gpu("nvidia", "gtx-1080") 
      }
  }
  p = hcl.platform.custom(config)

  # case 1. move tensors to HBM with splitting factor: the input tensors are 
  # split into multiple pieces and each piece assigned to a separate CU
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

  # case 2. assign the channel explicitly with a single CU
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

  # case 3. reshape and split along certain dimmension  
  s.reshape([A, B], (2, 16))
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

hecmay commented 4 years ago

case 1. move tensors to HBM with splitting factor: the input tensors are split into multiple pieces and each piece assigned to a separate CU A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

We do not have such a kernel here to apply parallel primitive. That's why I used this entangled approach as a workaround. All stages between the moved tensors form a kernel: as shown in the example here. If we move tensor A and B to device, and move tensor ret back to host, then the combination of all stages in the middle (i.e. hcl.compute stages to compute tensor C, D, E) is considered as the kernel in this program.

A = hcl.placeholder((10,32))
B = hcl.placeholder((10,32))

def kernel(A, B):
    C = hcl.compute((10,32), lambda *args: A[args] + 1, name="C")
    D = hcl.compute((10,32), lambda *args: B[args] + 1. name="D")
    E = hcl.compute((10,16), lambda *args: C[args] * D[args])
    return E

In this example if we move A and B to device, and move tensor E back to host, the generated IR code is as followed (where kernel(int* A, int* B, int* E) is the device function, and is called from the host in the last line). The kernel function is generated by analyzing the device-host boundary (i.e. the tensors moved with .to() API). Each kernel can contain more than one stages (3 stages to compute tensor C, D, E respectively in this example).

def kernel(int* A, int* B, int* E) {
    int C[10 * 32];
    for (i, 0, 10) {
        for (j, 0, 32) {
            C = A[i*32 + j] + 1;
        }
    }
    int D[10 * 32];
    for (i, 0, 10) {
        for (j, 0, 32) {
            D = B[i*32 + j] + 1;
        }
    }
    for (i, 0, 10) {
        for (j, 0, 32) {
            E = C[i*32 + j] + D[I*32 +j];
        }
    }
}

// call the kernel function 
kernel(A, B, E)

The parallel primitive is designed to be used in a single stage. However, in this example, we want to duplicate the kernel function which has 3 stages inside (i.e. hcl.compute stages to compute tensor C, D, E).

And the problem is, we can only access these inner stages with HeteroCL schedule (e.g. using s[kernel.D] to access and modify the information), the kernel function itself does not have any corresponding token to be addressed by the users. Actually, the kernel function does not come into play until the IR pass phase. The IR pass analyzes the boundary based on the data movement information, and create the kernel function from it. Before lowering into IR phase, the kernel function itself does not really exist and cannot be seen by the users.

Some workarounds:

Use parallel primitive to each stage inside the kernel (i.e. device) function. This requires the user to have the CDFG in mind and always be aware what stages are in the kernel function.
```
s.paralle([C, D, E], factor=2)
```
As discussed with Sean, we can consider there is only one stage in the device scope. If there are more then one stage, like the example provided above, we combine them into a single stage using compute_at.
```
# compute tensor E and D in stage C
s[E].compute_at(C)
s[D].compute_at(C)
```

now there is only one stage in device scope

duplicate it twice

s[C].parallel(axis=0, factor=2)


3. We can let users to specify a kernel name, and access the imaginary kernel with the name. 
```python
s.to(A, target.xcel, kernel_name="test")
# duplicate the kernel function twice
s.parallel(kernel.test, factor=2)

@seanlatias @zhangzhiru

zhangzhiru commented 4 years ago

I suppose we have a similar problem when partitioning an on-chip memory? We need to come up with a well-thought-out to handle both on-chip and off-chip data partitioning in conjunction with the compute duplication. Can we discuss more in our regular meeting?

hecmay commented 4 years ago

I suppose we have a similar problem when partitioning an on-chip memory? We need to come up with a well-thought-out to handle both on-chip and off-chip data partitioning in conjunction with the compute duplication. Can we discuss more in our regular meeting?

The on-chip buffer partitioning is actually not a big problem. I discussed with Sean, and we both think the second solution is the best way to solve the problem. In #171 , I will only add the initial support for HBM channel allocation without CU duplication.

zhangzhiru commented 4 years ago

Please explain why partitioning the on chip memory is different. We are supposed to duplicate the compute unit as well to make use of the increased bandwidth

hecmay commented 4 years ago

The partitioning should be the all the same in different cases (on-chip, off-chip, either with a single CU or multiple CUs) in terms of abstraction. It's more about implementation details to be taken care of. For example, the code generator should create multiple OpenCL buffers for partitioned off-chip buffer, and we only need to insert a partition pragma for partitioned on-chip buffer.

For the CU duplication part. The only viable solution we have now, as I mentioned in the post above, highly relies on compute_at. However, as I tried this afternoon. compute_at is kind of buggy and oftentimes errors out with SegFault. I guess it might be better to create another PR to fix compute_at, instead of having everything in a single PR.

zhangzhiru commented 4 years ago

Okay, let's file bugs first if compute_at is not working properly.

hecmay commented 4 years ago

Solved in #171

cornell-zhang / heterocl

Data Movement in Heterogenous Memory System #180

now there is only one stage in device scope

duplicate it twice