Xilinx / mlir-air

MIT License
71 stars 26 forks source link

Multi Core DMA Matrix Scalar Add Example Fails #624

Closed hunhoffe closed 2 weeks ago

hunhoffe commented 3 weeks ago

From branch debugging_matrix_scalar_add, I am working on getting my example multi_core_dma working (this file). The single core version works in this branch, but when I increase the herd size from 1x1 to 2x2, the example does not work any more.

To be specific, to replicate run:

cd programming_examples/matrix_scalar_add/multi_core_dma
make

When I inspect programming_examples/matrix_scalar_add/multi_core_dma/build/air_project/npu.air.mlir, I notice that it looks like all of the data is only going to one core instead of being distributed to all of the cores.

      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 2 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 3 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 4 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 5 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 6 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 7 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
hunhoffe commented 3 weeks ago

It should now be possible to also replicate the failure in the minimal-matrix-scalar-add branch

erwei-xilinx commented 2 weeks ago

https://github.com/Xilinx/mlir-air/pull/637 should enable the test now. There was an error in the logic for generating symbolic linkage between AIRRt dma ops and aie.device.

hunhoffe commented 2 weeks ago

I believe this is fixed in the mentioned PR, so closing this issue!