Xilinx / mlir-air

MIT License
70 stars 26 forks source link

Multi Core Channel Matrix Scalar Add Example Fails #625

Closed hunhoffe closed 6 hours ago

hunhoffe commented 1 week ago

From branch debugging_matrix_scalar_add, I am working on getting my example multi_core_channel working (this file). The single core version works in this branch, but when I increase the herd size from 1x1 to 2x2, the example does not work any more.

To be specific, to replicate run:

cd programming_examples/matrix_scalar_add/multi_core_channel
make

When I inspect programming_examples/matrix_scalar_add/multi_core_channel/build/air_project/npu.air.mlir, I notice that it looks like all of the data is only going to one core instead of being distributed to all of the cores.

      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
hunhoffe commented 1 week ago

It should now be possible to also replicate the failure in the minimal-matrix-scalar-add branch

erwei-xilinx commented 6 days ago

I saw that the example was instantiating one air.channel (not broadcasting) is trying to access 4 different aie cores at different time phases via a for-loop nest. This is not going to work for now because air.channel lowers to circuit-switched routings using aie.flow ops, which are static connections. This might be enabled in the future when in the future we lower to dynamically routed packet-switched routing.

For now, I can only see the test to work by instantiating four air.channels (or a bundle of [2,2] air.channels), spatially parallelized via either hand unrolling or using a parent scf.parallel/forall loop.

erwei-xilinx commented 6 days ago

https://github.com/Xilinx/mlir-air/pull/637 also contains some minor fixes.

hunhoffe commented 6 days ago

Thank you for the fix ups!

I will go ahead and rewrite the example in a way that uses additional channels (one of the ways you suggested).

hunhoffe commented 6 days ago

I believe I have a version that works here: https://github.com/Xilinx/mlir-air/pull/641

hunhoffe commented 6 hours ago

Working version merged in!