cornell-zhang / hcl-dialect

HeteroCL-MLIR dialect for accelerator design
https://cornell-zhang.github.io/heterocl/index.html
Other
37 stars 15 forks source link

Undefined behavior introduced by `reuse_at` transformation in HeteroCL #185

Open wyanzhao opened 1 year ago

wyanzhao commented 1 year ago

We are currently performing formal verification on HeteroCL, and we have discovered an undefined behavior introduced by the reuse_attransformation. The input code is as follows:

func.func @blur_x(%A: memref<10x10xf32>, %B: memref<8x10xf32>)
{
    %s = hcl.create_op_handle "s"
    %li = hcl.create_loop_handle %s, "i"
    %lj = hcl.create_loop_handle %s, "j"
    affine.for %i = 0 to 8 {
        affine.for %j = 0 to 10 {
            %tmp = affine.load %A[%i, %j] : memref<10x10xf32>
            %tmp1 = affine.load %A[%i+1, %j] : memref<10x10xf32>
            %tmp2 = affine.load %A[%i+2, %j] : memref<10x10xf32>
            %sum = arith.addf %tmp, %tmp1: f32
            %sum1 = arith.addf %sum, %tmp2: f32
            affine.store %sum1, %B[%i, %j] : memref<8x10xf32>
        } { loop_name = "j" }
    } { loop_name = "i", op_name = "s" }
    %buf = hcl.reuse_at(%A: memref<10x10xf32>, %li) -> memref<3x10xf32>
    return
}

After running hcl-opt -opt, the output code is:

  func.func @blur_x(%arg0: memref<10x10xf32>, %arg1: memref<8x10xf32>) {
    %0 = memref.alloc() {name = "s_reuse_0"} : memref<3x10xf32>
    affine.for %arg2 = 0 to 10 {
      affine.for %arg3 = 0 to 10 {
        %1 = affine.load %0[1, %arg3] : memref<3x10xf32>
        affine.store %1, %0[0, %arg3] : memref<3x10xf32>
        %2 = affine.load %0[2, %arg3] : memref<3x10xf32>
        affine.store %2, %0[1, %arg3] : memref<3x10xf32>
        %3 = affine.load %arg0[%arg2, %arg3] : memref<10x10xf32>
        affine.store %3, %0[2, %arg3] : memref<3x10xf32>
      } {spatial}
      affine.if #set0(%arg2) {
        affine.for %arg3 = 0 to 10 {
          %1 = affine.load %0[0, %arg3] : memref<3x10xf32>
          %2 = affine.load %0[1, %arg3] : memref<3x10xf32>
          %3 = affine.load %0[2, %arg3] : memref<3x10xf32>
          %4 = arith.addf %1, %2 : f32
          %5 = arith.addf %4, %3 : f32
          affine.store %5, %arg1[%arg2 - 2, %arg3] : memref<8x10xf32>
        } {loop_name = "j"}
      }
    } {loop_name = "i", op_name = "s"}
    return
  }

The issue occurs when HeteroCL allocates a new memory block (%0 = memref.alloc() {name = "s_reuse_0"} : memref<3x10xf32>) and reads from it without initializing it first (%1 = affine.load %0[1, %arg3] : memref<3x10xf32>.

In MLIR, reading uninitialized memory is considered undefined behavior. The input code only has undefined behavior of reading from A and B when they are uninitialized, which implies that the reuse_at transformation introduces new undefined behavior does not exist in the input code.

chhzh123 commented 1 year ago

Thanks for providing the issue. In this piece of code, those undefined values will not be used in actual computation, so if you plug in some example inputs, the output should be the same as the original program. Indeed, it would be better to initialize those arrays, but we have not figured out a clean way that will not introduce hardware overheads. Explicitly writing for loops to initialize increases latency.

wyanzhao commented 1 year ago

Thank you for your response. According to some research, writing an uninitialized value back to the same location can lead to undefined behavior on some specific backends, such as C-based compilers, as mentioned in this article: https://queue.acm.org/detail.cfm?id=3041020, please check this example in the article:

void f(void) {
  unsigned char x[1]; /* intentionally uninitialized */
  x[0] ^= x[0];
  printf("%d\n", x[0]);
  printf("%d\n", x[0]);
  return;
}
chhzh123 commented 1 year ago

Right, this is problematic. We need to discuss what is the best way to initialize the array. It seems MLIR does not provide shorthands like int a[10] = {0} to initialize all the elements to zero.

wyanzhao commented 1 year ago

It appears that using linalg.fill could be an effective solution for addressing the uninitialized memory issue in HeteroCL. Xilinx adopts this approach to initialize memref in their MLIR-AIR code, as demonstrated in the following documentation: https://xilinx.github.io/mlir-air/AIRTransformPasses.html. Here is an example from the documentation that illustrates the usage of linalg.fill to initialize a memref:

#map = affine_map<()[s0] -> (s0 * 32)>
module attributes {torch.debug_module_name = "mmult"} {
  func.func @forward(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>, %arg2: memref<1024x1024xf32>) {
    %c1024 = arith.constant 1024 : index
    %cst = arith.constant 0.000000e+00 : f32
    %c32 = arith.constant 32 : index
    %c128 = arith.constant 128 : index
    %c0 = arith.constant 0 : index
    %c4 = arith.constant 4 : index
    %0 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
    %1 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
    linalg.fill ins(%cst : f32) outs(%0 : memref<1024x1024xf32>)
    memref.copy %0, %1 : memref<1024x1024xf32> to memref<1024x1024xf32>
  }
}
zhangzhiru commented 1 year ago

Just a note that in practical hardware design, we do not want to initialize a memory block if it does not affect the functionality. The overhead is too high. So adding a fill operation or an initializer is not the solution here.

@chhzh123 can we avoid reading from the uninitialized array in this specific case?

chhzh123 commented 1 year ago

It also requires some control logic to skip those uninitialized elements, which may also increase hardware complexity.