cornell-zhang / hcl-dialect

HeteroCL-MLIR dialect for accelerator design
https://cornell-zhang.github.io/heterocl/index.html
Other
38 stars 17 forks source link

Multiple if statements testing same scalar generates fault #148

Closed jcasas00 closed 1 year ago

jcasas00 commented 1 year ago
def test_recexecute():
    hcl.init()
    rshape = (1,)

    def kernel():
        inst_id = hcl.scalar(0, "inst_id", dtype=hcl.UInt(16)).v
        with hcl.if_(inst_id == 0):
            hcl.print((), "match\n")
        with hcl.if_(inst_id == 1):
            hcl.print((), "match\n")
        r = hcl.compute(rshape, lambda _:0, dtype=hcl.UInt(32))
        return r
    #
    s = hcl.create_schedule([], kernel)
    #print(hcl.lower(s))
    hcl_res = hcl.asarray(np.zeros(rshape, dtype=np.uint32), dtype=hcl.UInt(32))
    f = hcl.build(s)

generates the following error:

error: 'affine.load' op operation destroyed but still has uses
LLVM ERROR: operation destroyed but still has uses
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
hcl-dialect-prototype/build/tools/hcl/python_packages/hcl_core/hcl_mlir/_mlir_libs/libHCLMLIRAggregateCAPI.so.15(+0x11be8e4)[0x14d18d68e8e4]
...

A workaround is to NOT lookup the .v value on assignment and instead use the .v where inst_id is used. Intuitively, this change shouldn't matter though. Is there an assumption that each .v result is only used once? If so, the coding style used above where the the .v is dereferenced once but used multiple times will not work.

zzzDavid commented 1 year ago

Same issue as https://github.com/cornell-zhang/hcl-dialect-prototype/issues/71

jcasas00 commented 1 year ago

This issue is turning out to be what's causing the most issues now (and workarounds can be quite invasive). If all references to the .v are within one function, it is easy to fix. The problem is that if it is passed to some other function, then failures would depend on how that function uses it, etc. We're suddenly in a situation where changing some low-level function (e.g., to add a bounds check) will cause a fault because it created additional references to the expression.

jcasas00 commented 1 year ago

The same issue also seems to apply to struct fields.

zzzDavid commented 1 year ago

Thanks! We are discussing about a proper fix for this and many related issues. Hongzheng explained this in one of our conversations, let me paste it here for reference:

I notice most of the bugs of the control flow constructs are actually caused by Python’s language features. For example, #71, #135, #138, #141, #148 are all hcl.if_/hcl.while_ problems. The main complexity comes from the Python evaluation order. Consider hcl.if_(x == 0), the condition x==0 part is evaluated (the IR is already built) before calling the hcl.if_ function. In the current workflow, we need to do the following things:

  1. Traverse back from x==0 (CmpOp) and check whether this expression is affine. If the expression is non-affine, then it is fine. We can simply generate the scf.if statement and plug the result of CmpOp into it.
  2. If the expression is affine, we need to traverse back again to build an affine set expression and remove all the previous operations involving this condition, otherwise the condition will be computed twice. We require this separation between affine and non-affine expression to leverage the affine.if facility.

When traversing back, we may remove some of the useful operations by mistake, since we do not know the scope of the variable before executing the following code. It is not like traditional parser that can read the code from left to right and generate the corresponding IR. Currently we are highly restricted by the host language, so it is a bit counterintuitive when building these control flows with conditions.

zzzDavid commented 1 year ago

Hi @jcasas00, could you try latest commits and see if it works?

jcasas00 commented 1 year ago

It seems to work with the .v usage now.

But it still fails if it is a struct field. For example, if the inst_id in the example above was defined as:

stype = hcl.Struct({"x": hcl.UInt(8), "y": hcl.UInt(8)})
inst_id = hcl.scalar(0x1234, "foo", dtype=stype).v.x

The error in this case is "error: operand #0 does not dominate this use". But the usage pattern that causes the issue is the same (i.e., multiple hcl.if_(i ...) statements).

zzzDavid commented 1 year ago

Ah yes, the scope issue. When a LoadOp is moved to another block I also rebuild one in the original place. I haven't done that for StructGetOp. I'll add this for StructGetOp, GetBitOp, GetSliceOp etc.

jcasas00 commented 1 year ago

The same error (#0 does not dominate this use) in the following scenario (note the "+ 1"):

inst_id = hcl.scalar(0, "inst_id", dtype=hcl.UInt(16)).v + 1

zzzDavid commented 1 year ago

Hi Jeremy, could you pull both repositories and see if it works now? I've added test cases on the CI: https://github.com/cornell-zhang/heterocl/blob/hcl-mlir/tests/mlir/test_scalar.py

The test cases cover

jcasas00 commented 1 year ago

This version seems to work now! Thanks for the quick fixes/updates!!

jcasas00 commented 1 year ago

I'm seeing another failure with the same cause signature (i.e., an expression used multiple times).

  def kernel():
        f = hcl.scalar(0, "f", dtype='uint32')
        t = hcl.scalar(0, "t", dtype='uint32')
        c = f.v == 0
        with hcl.if_(c == 0):
            hcl.print((),"c is false")
            with hcl.while_(c == 0):    # ... hang
                t.v = 0 # some irrelevant expr
        r = hcl.compute(rshape, lambda _:0, dtype=hcl.Int(32))
        return r

In this case, because the hcl.while_ statement is using 'c', the following error is generated:

python: /home/jcasas/dprive/llvm-project/mlir/lib/IR/Block.cpp:231: mlir::Operation* mlir::Block::getTerminator(): Assertion '!empty() && back().mightHaveTrait<OpTrait::IsTerminator>()' failed.

The intent of this code was an assert w/a for now (i.e., print the failure and hang). I could use a different while test (e.g., while((f.v != 0) and get it to work but the system should still not fault with the code above.

zzzDavid commented 1 year ago

This seems like a different issue with the same cause. hcl.while_ used ASTVisitor in "remove" mode to remove and rebuild the condition operations in scf.while's condition block. Consequently, it removed the loadOp built from f.v, causing an assertion fault. I have fixed this issue by replacing the AST remover with an AST mover that doesn't remove any op but instead moves them to the right block. Commit: 592b29864d4b36ec85bfa9d82c0d22209d19d6bd

jcasas00 commented 1 year ago

Removing urgent tag as the primary issue has been resolved.

zzzDavid commented 1 year ago

The scope issue has been resolved by revamping the frontend and adding HCL AST: https://github.com/cornell-zhang/heterocl/pull/473