cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 93 forks source link

AOC HLS cannot pipeline the CONV layer with line/window buffer #307

Open hecmay opened 3 years ago

hecmay commented 3 years ago

The HLS report from AOC indicates that the CONV layer in BNN design is not pipelined because of some "Unsolvable exit condition". Here is the HLS report and the reduced test case (i.e. single layer CONV with reuse buffers):

Screen Shot 2020-11-01 at 5 54 41 PM

The HLS program is functionally correct and can achieve II=1 in Vivado HLS, but it seems that AOC does not like this programming style and cannot create a pipeline for it. Maybe we should use an AOC friendly way to implement the reuse buffers in CONV.

zhangzhiru commented 3 years ago

Do they provide an example on how to implement the line buffer?

hecmay commented 3 years ago

@zhangzhiru I am checking the AOC examples under $INTELFPGAOCLSDKROOT/19.4/hld/examples_aoc folder.

zhangzhiru commented 3 years ago

Let's ask Bain if we need help.

paldebjit commented 3 years ago

I think an example is available in the video_downscaling kernel around L248. There they have the window that is capturing the pixel value from a read channel that will be resued. @Hecmay I am not sure completely but possibly that's a good example.

hecmay commented 3 years ago

Thanks @paldebjit. In that example, they have balanced statements under if and the else branch, I guess in our case, we cannot easily create such a pattern for reuse buffer.

I am wondering if this workaround would work -- can we move the if (xx_reuse>=2) and if(yy_reuse>=2) into the innermost loop body? Namely, we do some CONV2d calculation on some garbage data for the first few iterations, and the innermost loop discards the result computed from the garbage input until some meaningful pixels come in.

I checked a few AOC examples, and found this programming style is more preferable by AOC. I tried to use this workaround and get II>=1 for the yy_reuse and xx_resue loops. @seanlatias @paldebjit

zhangzhiru commented 3 years ago

II>=1 means the problem is still not resolved?

hecmay commented 3 years ago

@zhangzhiru. In AOC's HLS report, the status of the loop is pipelined, and the II is ">=1" (for example, you can see the loop in the 7th row in the snapshot, it has II ">=1"). In my understanding, that means II should equal to 1 if there is not any other runtime conflicts happening.

In our optimized optical flow design for Intel, we also saw many loops with II >= 1 or II ~= 1 in AOC's HLS report. but that should not be a problem, I suppose.

hecmay commented 3 years ago

Here is more detailed example here. After applying line & window buffer optimization in a 2D CONV kernel, we will get the following HLS code from HCL. In the code snippet, we will have 2 condition checking (right under the comments) -- the computation will only happen when the coordinate has x-index and y-index both greater than a threshold.

__attribute__((max_global_work_dim(0)))
__attribute__((autorun))
__kernel void ar20() {
    uint16_t w_layer1_1_conv2[144];
    uint16_t layer1_1_conv2_LB[102];
    uint16_t layer1_1_conv2_WB[9];
    for (int32_t nn14 = 0; nn14 < 1; ++nn14) {
      for (int32_t yy_reuse4 = 0; yy_reuse4 < 34; ++yy_reuse4) {
        for (int32_t xx_reuse4 = 0; xx_reuse4 < 34; ++xx_reuse4) {
          for (int32_t layer1_1_conv2_pad_1 = 0; layer1_1_conv2_pad_1 < 2; ++layer1_1_conv2_pad_1) {
            layer1_1_conv2_LB[(xx_reuse4 + (layer1_1_conv2_pad_1 * 34))] = layer1_1_conv2_LB[((xx_reuse4 + (layer1_1_conv2_pad_1 * 34)) + 34)];
          }
          uint16_t layer1_1_conv2_pad_temp1;
          layer1_1_conv2_pad_temp1 = read_channel_intel(layer1_1_conv2_pad_pipe_23);
          layer1_1_conv2_LB[(xx_reuse4 + 68)] = layer1_1_conv2_pad_temp1;

          # Condition 1: checking the filter y-index
          if (2 <= yy_reuse4) {
            for (int32_t layer1_1_conv2_LB_1 = 0; layer1_1_conv2_LB_1 < 3; ++layer1_1_conv2_LB_1) {
              for (int32_t layer1_1_conv2_LB_0 = 0; layer1_1_conv2_LB_0 < 2; ++layer1_1_conv2_LB_0) {
                layer1_1_conv2_WB[(layer1_1_conv2_LB_0 + (layer1_1_conv2_LB_1 * 3))] = layer1_1_conv2_WB[((layer1_1_conv2_LB_0 + (layer1_1_conv2_LB_1 * 3)) + 1)];
              }
              layer1_1_conv2_WB[((layer1_1_conv2_LB_1 * 3) + 2)] = layer1_1_conv2_LB[(xx_reuse4 + (layer1_1_conv2_LB_1 * 34))];
            }
            for (int32_t ff4 = 0; ff4 < 16; ++ff4) {

              # Condition 2: checking the filter x-index
              if (2 <= xx_reuse4) {
                int8_t layer1_1_conv2_sum;
                for (int32_t layer1_1_conv2_rc = 0; layer1_1_conv2_rc < 1; ++layer1_1_conv2_rc) {
                  for (int32_t layer1_1_conv2_ry = 0; layer1_1_conv2_ry < 3; ++layer1_1_conv2_ry) {
                    for (int32_t layer1_1_conv2_rx = 0; layer1_1_conv2_rx < 3; ++layer1_1_conv2_rx) {
                      for (int32_t layer1_1_conv2_rb = 0; layer1_1_conv2_rb < 16; ++layer1_1_conv2_rb) {
                        layer1_1_conv2_sum = ((int8_t)(((int32_t)(((layer1_1_conv2_WB[(layer1_1_conv2_rx + (layer1_1_conv2_ry * 3))] ^ w_layer1_1_conv2[((layer1_1_conv2_rx + (layer1_1_conv2_ry * 3)) + (ff4 * 9))]) & (1L << layer1_1_conv2_rb)) >> layer1_1_conv2_rb)) + ((int32_t)layer1_1_conv2_sum)));
                      }
                    }
                  }
                }
                int8_t layer1_1_conv2_temp;
                layer1_1_conv2_temp = ((int8_t)(144 - ((int32_t)(layer1_1_conv2_sum << 1))));
                write_channel_intel(layer1_1_conv2_pipe_24, layer1_1_conv2_temp);
              }
            }
          }
        }
      }
    }
}

Intel AOC HLS tool does not like this programming style. I carefully checked their example code. Most of their examples use balanced statements under if and else branch. Or when there is only a if brach, the logic inside is very simple (e.g. just a few read or write operations).

I changed our reuse buffer a bit to adapt to this programming style -- I moved the if (xx_reuse>=2) and if(yy_reuse>=2) into the innermost loop body. Namely, we do some CONV2d calculation on some garbage data for the first few iterations, and the innermost loop discards the result computed from the garbage input until some meaningful pixels come in. This approach will introduce some meaningless computation, but it makes the if condition more simplified.

Here is the code after modification:

__attribute__((max_global_work_dim(0)))
__attribute__((autorun))
__kernel void ar2() {
    uint16_t w_layer1_0_conv1[144]; 
    uint16_t layer1_0_conv1_LB[102];
    uint16_t layer1_0_conv1_WB[9];
    for (int32_t nn2 = 0; nn2 < 1; ++nn2) {
      for (int32_t yy_reuse1 = 0; yy_reuse1 < 34; ++yy_reuse1) {
        for (int32_t xx_reuse1 = 0; xx_reuse1 < 34; ++xx_reuse1) {
          for (int32_t layer1_0_conv1_pad_1 = 0; layer1_0_conv1_pad_1 < 2; ++layer1_0_conv1_pad_1) {
            layer1_0_conv1_LB[(xx_reuse1 + (layer1_0_conv1_pad_1 * 34))] = layer1_0_conv1_LB[((xx_reuse1 + (layer1_0_conv1_pad_1 * 34)) + 34)];
          }
          uint16_t layer1_0_conv1_pad_temp1;
          layer1_0_conv1_pad_temp1 = read_channel_intel(layer1_0_conv1_pad_pipe_5);
          layer1_0_conv1_LB[(xx_reuse1 + 68)] = layer1_0_conv1_pad_temp1;

            for (int32_t layer1_0_conv1_LB_1 = 0; layer1_0_conv1_LB_1 < 3; ++layer1_0_conv1_LB_1) {
              for (int32_t layer1_0_conv1_LB_0 = 0; layer1_0_conv1_LB_0 < 2; ++layer1_0_conv1_LB_0) {
                layer1_0_conv1_WB[(layer1_0_conv1_LB_0 + (layer1_0_conv1_LB_1 * 3))] = layer1_0_conv1_WB[((layer1_0_conv1_LB_0 + (layer1_0_conv1_LB_1 * 3)) + 1)];
              }
              layer1_0_conv1_WB[((layer1_0_conv1_LB_1 * 3) + 2)] = layer1_0_conv1_LB[(xx_reuse1 + (layer1_0_conv1_LB_1 * 34))];
            }
            for (int32_t ff1 = 0; ff1 < 16; ++ff1) {
                int8_t layer1_0_conv1_sum;
                for (int32_t layer1_0_conv1_rc = 0; layer1_0_conv1_rc < 1; ++layer1_0_conv1_rc) {
                  for (int32_t layer1_0_conv1_ry = 0; layer1_0_conv1_ry < 3; ++layer1_0_conv1_ry) {
                    for (int32_t layer1_0_conv1_rx = 0; layer1_0_conv1_rx < 3; ++layer1_0_conv1_rx) {
                      for (int32_t layer1_0_conv1_rb = 0; layer1_0_conv1_rb < 16; ++layer1_0_conv1_rb) {
                        layer1_0_conv1_sum = ((int8_t)(((int32_t)(((layer1_0_conv1_WB[(layer1_0_conv1_rx + (layer1_0_conv1_ry * 3))] ^ w_layer1_0_conv1[((layer1_0_conv1_rx + (layer1_0_conv1_ry * 3)) + (ff1 * 9))]) & (1L << layer1_0_conv1_rb)) >> layer1_0_conv1_rb)) + ((int32_t)layer1_0_conv1_sum)));
                      }
                    }
                  }
                }
                int8_t layer1_0_conv1_temp;
                layer1_0_conv1_temp = ((int8_t)(144 - ((int32_t)(layer1_0_conv1_sum << 1))));

                # if condition is moved here
                if (2 <= yy_reuse1 && (2 <= xx_reuse1) ) {
                  write_channel_intel(layer1_0_conv1_pipe_6, layer1_0_conv1_temp);
                }              
            }
        }
      }
    }
  }