cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

Support `StreamExpr` in reuse pattern detection #160

Closed hecmay closed 3 years ago

hecmay commented 4 years ago

For now reuse_at can only handle the reuse pattern found in Load expression. However, the Streaming IR pass replaces some potentially reusable Load expressions with StreamExpr expressions, making the generate_reuse_buffer IR pass crash with SegFault.

Example of generated Intel OpenCL code with streaming channel as followed. the Input image is replaced with StreamExpr in this case(i.e., read_channel_intel(c_buf_4)), applying resue_at on this input will result in Seg Fault.

__kernel void grad_weight_y() {
    float g_f[7];
    g_f[0] = 7.550000e-02f;
    g_f[1] = 1.330000e-01f;
    g_f[2] = 1.869000e-01f;
    g_f[3] = 2.903000e-01f;
    g_f[4] = 1.869000e-01f;
    g_f[5] = 1.330000e-01f;
    g_f[6] = 7.550000e-02f;
    for (int32_t y = 0; y < 430; ++y) {
      for (int32_t x = 0; x < 1024; ++x) {
        float reducer2;
        reducer2 = 0.000000e+00f;
        for (int32_t rdx = 0; rdx < 7; ++rdx) {
          reducer2 = ((read_channel_intel(c_buf_4) * g_f[rdx]) + reducer2);
        }
        grad_weight_y_y_filt[((x + (y * 1024)) + 3072)] = reducer2;
      }
    }
    #pragma ii 1
    for (int32_t buf_1 = 0; buf_1 < 1024; ++buf_1) {
      for (int32_t buf_0 = 0; buf_0 < 436; ++buf_0) {
        write_channel_intel(c_buf_5, grad_weight_y_y_filt[(buf_1 + (buf_0 * 1024))]);
      }
    }
zhangzhiru commented 4 years ago

@Hecmay What is read_channel_intel? Note that we should not have target-specific primitive in our IR. This is supposed to be handled by different back-ends.

seanlatias commented 4 years ago

I think he is showing the generated Intel OpenCL code, not the IR. For IR, we only have channel_read and channel_write. The final implementation is back-end dependent.

hecmay commented 4 years ago

read_channel_intel is the API function provided in Intel FPGA SDK for OpenCL to read data from a streaming channel. It corresponds to StreamStmt or StreamExpr IR node in HeteroCL IR. The IR representation is similar to the generated code.

def grad_weight_y(handle64(grad_weight_y.pack[436*1024]), handle64(grad_weight_y.y_filt[436*1024])) {
  // attr [c_buf_5] storage_scope = "local"
  allocate c_buf_5[int32 * 436 * 1024]
  // attr [c_buf_4] storage_scope = "local"
  allocate c_buf_4[int32 * 436 * 1024]
  // attr [g_f] storage_scope = "global"
  allocate g_f[float32 * 7]
  produce g_f {
    // attr [0] extern_scope = 0
    g_f[0] = 0.075500f
    g_f[1] = 0.133000f
    g_f[2] = 0.186900f
    g_f[3] = 0.290300f
    g_f[4] = 0.186900f
    g_f[5] = 0.133000f
    g_f[6] = 0.075500f
  }
  for (y, 0, 430) {
    for (x, 0, 1024) {
      allocate reducer2[float32 * 1]
      reducer2[0] = 0.000000f
      for (rdx, 0, 7) {
        reducer2[0] = ((c_buf_4.read()*g_f[rdx]) + reducer2[0])
      }
      grad_weight_y.y_filt[((x + (y*1024)) + 3072)] = reducer2[0]
    }
  }
  pipelined (buf_1, 0, 1024) {
    for (buf_0, 0, 436) {
      c_buf_5.write(grad_weight_y.y_filt[(buf_1 + (buf_0*1024))])
    }
  }
}
zhangzhiru commented 4 years ago

I see. I was confused by "StreamExpr (i.e., read_channel_intel)". These two are not supposed to be listed in parallel then.

hecmay commented 3 years ago

The issue is closed since it does not exist anymore. In the current implementation, s.to() only inserts some annotations into the IR instead of mutating the IR directly. The lowering function will replace the annotated Load IR nodes with StreamExpr nodes after the reuse buffer is inserted into the IR. In other words, the reuse pattern detection function only needs to handle Load IR nodes, and later passes will replace annotated Load IR nodes with StreamExpr nodes.