Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.
Other
260 stars 76 forks source link

Failing Assertion in ObjectFifoStatefulTransformPass::unrollForLoops #1128

Open andrej opened 3 months ago

andrej commented 3 months ago

This one should probably be assigned to Andra. It seems some recent changes to the ObjectFifo are causing an issue for me. The following compiled fine for me a couple weeks ago.

Try to build reference_designs/ipu-xrt/matrix_multiplication_array with the following command:

M=256 K=256 N=768 make

The compiler then crashes during this step (when trying to make):

cd build && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=final.xclbin \
                        --aie-generate-ipu --ipu-insts-name=insts.txt ../build/aie.mlir

With the following failed assertion:

/usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = int; _Alloc = std::allocator<int>; std::vector<_Tp, _Alloc>::reference = int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Here is a partial stack trace identifying some object fifo code as the culprit:

__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352503744) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) backtrace
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352503744) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352503744) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352503744, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff1f65048 in std::__replacement_assert(char const*, int, char const*, char const*) ()
   from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#6  0x00007ffff20891ff in AIEObjectFifoStatefulTransformPass::duplicateBlock(mlir::OpBuilder&, int, std::vector<mlir::Operation*, std::allocator<mlir::Operation*> >&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >&, mlir::Value, long, bool) [clone .isra.0] ()
   from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#7  0x00007ffff2096021 in AIEObjectFifoStatefulTransformPass::unrollForLoops(xilinx::AIE::DeviceOp&, mlir::OpBuilder&, std::set<xilinx::AIE::TileOp, std::less<xilinx::AIE::TileOp>, std::allocator<xilinx::AIE::TileOp> >)::{lambda(mlir::scf::ForOp)#1}::operator()(mlir::scf::ForOp) const ()
   from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#8  0x00007ffff2096308 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) [clone .constprop.5] ()
--Type <RET> for more, q to quit, c to continue without paging--c
  /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#9  0x00007ffff209671b in AIEObjectFifoStatefulTransformPass::unrollForLoops(xilinx::AIE::DeviceOp&, mlir::OpBuilder&, std::set<xilinx::AIE::TileOp, std::less<xilinx::AIE::TileOp>, std::allocator<xilinx::AIE::TileOp> >) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#10 0x00007ffff209c402 in AIEObjectFifoStatefulTransformPass::runOnOperation() () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#11 0x00007fffefa8ba9e in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#12 0x00007fffefa8bf58 in mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#13 0x00007fffefa8c5f3 in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::{lambda(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&)#1}::operator()(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&) const () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#14 0x00007fffefa8afd5 in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#15 0x00007fffefa8b8cf in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#16 0x00007fffefa8bf58 in mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#17 0x00007fffefa8ceb5 in mlir::PassManager::run(mlir::Operation*) () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so
#18 0x00007fffef9f8d79 in mlirPassManagerRunOnOp () from /home/andre/mlir-aie/my_install/mlir_aie/python/aie/_mlir_libs/libAIEAggregateCAPI.so

Thanks in advance for looking into this!

andrej commented 3 months ago

Just updated my comment above with the command to compile the breaking example after changes in #1056.

andrej commented 3 weeks ago

I ran in to this again and dug a little deeper to get a minimal working example. See below. This should make it easier to debug instead of using the whole matrix multiplication design.

Summary

Error

/usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = xilinx::AIE::BufferOp*; _Alloc = std::allocator<xilinx::AIE::BufferOp*>; std::vector<_Tp, _Alloc>::reference = xilinx::AIE::BufferOp*&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: 
Assertion '__n < this->size()' failed.
Aborted (core dumped)

Compilation Command

aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=bug.xclbin \
                         --aie-generate-npu --npu-insts-name=bug.txt bug.mlir

Code

The "unique" thing about this code is that we have a loop with only a single iteration. If we make it multiple iterations, the error does not happen. The error also does not happen when we only have two, not three, nested loops.

module {
  aie.device(npu1_4col) {

    %tile_0_1 = aie.tile(0, 1)
    %tile_0_2 = aie.tile(0, 2)

    aie.objectfifo @fifoA(%tile_0_2, {%tile_0_1}, 2 : i32) : !aie.objectfifo<memref<64x64xbf16>>
    aie.objectfifo @fifoB(%tile_0_1, {%tile_0_2}, 2 : i32) : !aie.objectfifo<memref<64x64xbf16>>

    %core_0_2 = aie.core(%tile_0_2) {

      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      %c4 = arith.constant 4 : index
      %c4294967295 = arith.constant 4294967295 : index

      scf.for %arg0 = %c0 to %c4294967295 step %c1 {
        scf.for %arg1 = %c0 to %c1 step %c1 {
          %0 = aie.objectfifo.acquire @fifoA(Produce, 1) : !aie.objectfifosubview<memref<64x64xbf16>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<64x64xbf16>> -> memref<64x64xbf16>
          scf.for %arg2 = %c0 to %c4 step %c1 {
            %2 = aie.objectfifo.acquire @fifoB(Consume, 1) : !aie.objectfifosubview<memref<64x64xbf16>>
            %3 = aie.objectfifo.subview.access %2[0] : !aie.objectfifosubview<memref<64x64xbf16>> -> memref<64x64xbf16>
            aie.objectfifo.release @fifoB(Consume, 1)
          }
          aie.objectfifo.release @fifoA(Produce, 1)
        }
      }

      aie.end

    }
  }
}

Alternative error

If we remove the two aie.objectfifo.subview.access statements, the error instead becomes:

/home/github/actions-runner/_work/mlir-aie/mlir-aie/mlir/src/python/MLIRPythonExtension.Core/IRModule.h:433:
mlir::python::PyMlirContext::ErrorCapture::~ErrorCapture(): Assertion `errors.empty() && "unhandled captured errors"' failed.
Aborted (core dumped)

Workaround

In the Python code that generates the MLIR, check if loops have a single iteration. If so, do not emit the loop.

cc @AndraBisca

andrej commented 3 weeks ago

After some more testing, this appears to affect not just loops with one iteration. For example, giving the middle loop nine iterations and the inner one four gives the same error, as follows:

module {
  aie.device(npu1_4col) {

    %tile_0_1 = aie.tile(0, 1)
    %tile_0_2 = aie.tile(0, 2)

    aie.objectfifo @fifoA(%tile_0_2, {%tile_0_1}, 2 : i32) : !aie.objectfifo<memref<64x64xbf16>>
    aie.objectfifo @fifoB(%tile_0_1, {%tile_0_2}, 2 : i32) : !aie.objectfifo<memref<64x64xbf16>>

    %core_0_2 = aie.core(%tile_0_2) {

      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      %c9 = arith.constant 9 : index
      %c4 = arith.constant 4 : index
      %c4294967295 = arith.constant 4294967295 : index

      scf.for %arg0 = %c0 to %c4294967295 step %c1 {
        scf.for %arg1 = %c0 to %c9 step %c1 {     //  <- 

          %0 = aie.objectfifo.acquire @fifoA(Produce, 1) : !aie.objectfifosubview<memref<64x64xbf16>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<64x64xbf16>> -> memref<64x64xbf16>

          scf.for %arg2 = %c0 to %c4 step %c1 {

            %2 = aie.objectfifo.acquire @fifoB(Consume, 1) : !aie.objectfifosubview<memref<64x64xbf16>>
            %3 = aie.objectfifo.subview.access %2[0] : !aie.objectfifosubview<memref<64x64xbf16>> -> memref<64x64xbf16>
            aie.objectfifo.release @fifoB(Consume, 1)

          }

          aie.objectfifo.release @fifoA(Produce, 1)

        }
      }
      aie.end
    }
  }
}

I also noticed the ObjectFIFO depth has to be > 1 for the error to trigger. (I think for depth=1, the loops are not unrolled.)