Open wyanzhao opened 1 year ago
Thanks for providing the issue. In this piece of code, those undefined values will not be used in actual computation, so if you plug in some example inputs, the output should be the same as the original program. Indeed, it would be better to initialize those arrays, but we have not figured out a clean way that will not introduce hardware overheads. Explicitly writing for
loops to initialize increases latency.
Thank you for your response. According to some research, writing an uninitialized value back to the same location can lead to undefined behavior on some specific backends, such as C-based compilers, as mentioned in this article: https://queue.acm.org/detail.cfm?id=3041020, please check this example in the article:
void f(void) {
unsigned char x[1]; /* intentionally uninitialized */
x[0] ^= x[0];
printf("%d\n", x[0]);
printf("%d\n", x[0]);
return;
}
Right, this is problematic. We need to discuss what is the best way to initialize the array. It seems MLIR does not provide shorthands like int a[10] = {0}
to initialize all the elements to zero.
It appears that using linalg.fill
could be an effective solution for addressing the uninitialized memory issue in HeteroCL. Xilinx adopts this approach to initialize memref in their MLIR-AIR code, as demonstrated in the following documentation: https://xilinx.github.io/mlir-air/AIRTransformPasses.html. Here is an example from the documentation that illustrates the usage of linalg.fill to initialize a memref:
#map = affine_map<()[s0] -> (s0 * 32)>
module attributes {torch.debug_module_name = "mmult"} {
func.func @forward(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>, %arg2: memref<1024x1024xf32>) {
%c1024 = arith.constant 1024 : index
%cst = arith.constant 0.000000e+00 : f32
%c32 = arith.constant 32 : index
%c128 = arith.constant 128 : index
%c0 = arith.constant 0 : index
%c4 = arith.constant 4 : index
%0 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
%1 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
linalg.fill ins(%cst : f32) outs(%0 : memref<1024x1024xf32>)
memref.copy %0, %1 : memref<1024x1024xf32> to memref<1024x1024xf32>
}
}
Just a note that in practical hardware design, we do not want to initialize a memory block if it does not affect the functionality. The overhead is too high. So adding a fill operation or an initializer is not the solution here.
@chhzh123 can we avoid reading from the uninitialized array in this specific case?
It also requires some control logic to skip those uninitialized elements, which may also increase hardware complexity.
We are currently performing formal verification on HeteroCL, and we have discovered an undefined behavior introduced by the
reuse_at
transformation. The input code is as follows:After running
hcl-opt -opt
, the output code is:The issue occurs when HeteroCL allocates a new memory block
(%0 = memref.alloc() {name = "s_reuse_0"} : memref<3x10xf32>)
and reads from it without initializing it first(%1 = affine.load %0[1, %arg3] : memref<3x10xf32>
.In MLIR, reading uninitialized memory is considered undefined behavior. The input code only has undefined behavior of reading from A and B when they are uninitialized, which implies that the
reuse_at
transformation introduces new undefined behavior does not exist in the input code.