Open hunhoffe opened 4 months ago
We do not have an example in AIR today to parametrise herd using launch induction variables, or any runtime scalar parameter passed in from the host. For now, we only have launch induction variables parametrising the SHIM DMA BDs for streaming the correct data into the herd.
Would be really cool to have an example in AIR which can lower any runtime parameters into herd as MLIR-AIE RTP.
I'm trying my hand at this to see if I can figure it out.
My example is here: https://github.com/Xilinx/mlir-air/blob/1bbc92a4b1dcaa66cbab50dc494af53bec1d8472/programming_examples/matrix_scalar_add/multi_launch_channel/multi_launch_channel.py
The initial AIR MLIR looks somewhat reasonable (to my novice eye):
#map = affine_map<()[s0] -> (s0 * 16)>
#map1 = affine_map<()[s0, s1] -> (s0 + s1)>
module {
air.channel @ChanIn []
air.channel @ChanOut []
func.func @copy(%arg0: memref<32x16xi32>, %arg1: memref<32x16xi32>) {
%c2 = arith.constant 2 : index
%c2_0 = arith.constant 2 : index
air.launch (%arg2, %arg3) in (%arg4=%c2, %arg5=%c2_0) args(%arg6=%arg0, %arg7=%arg1) : memref<32x16xi32>, memref<32x16xi32> {
%0 = affine.apply #map()[%arg2]
%1 = affine.apply #map()[%arg3]
%2 = affine.apply #map1()[%0, %arg3]
%3 = arith.index_cast %2 : index to i32
%c8 = arith.constant 8 : index
%c16 = arith.constant 16 : index
%c32 = arith.constant 32 : index
%c1 = arith.constant 1 : index
air.channel.put @ChanIn[] (%arg6[%0, %1] [%c8, %c16] [%c32, %c1]) : (memref<32x16xi32>)
%c8_1 = arith.constant 8 : index
%c16_2 = arith.constant 16 : index
%c32_3 = arith.constant 32 : index
%c1_4 = arith.constant 1 : index
air.channel.get @ChanOut[] (%arg7[%0, %1] [%c8_1, %c16_2] [%c32_3, %c1_4]) : (memref<32x16xi32>)
air.segment @seg args(%arg8=%3) : i32 {
%c1_5 = arith.constant 1 : index
%c1_6 = arith.constant 1 : index
air.herd @xaddherd tile (%arg9, %arg10) in (%arg11=%c1_5, %arg12=%c1_6) args(%arg13=%arg8) : i32 {
%alloc = memref.alloc() : memref<16x8xi32, 2 : i32>
%alloc_7 = memref.alloc() : memref<16x8xi32, 2 : i32>
air.channel.get @ChanIn[] (%alloc[] [] []) : (memref<16x8xi32, 2 : i32>)
%c0 = arith.constant 0 : index
%c8_8 = arith.constant 8 : index
%c1_9 = arith.constant 1 : index
scf.for %arg14 = %c0 to %c8_8 step %c1_9 {
%c0_10 = arith.constant 0 : index
%c16_11 = arith.constant 16 : index
%c1_12 = arith.constant 1 : index
scf.for %arg15 = %c0_10 to %c16_11 step %c1_12 {
%4 = memref.load %alloc[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
%5 = arith.addi %4, %arg13 : i32
memref.store %5, %alloc_7[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
}
}
air.channel.put @ChanOut[] (%alloc_7[] [] []) : (memref<16x8xi32, 2 : i32>)
memref.dealloc %alloc : memref<16x8xi32, 2 : i32>
memref.dealloc %alloc_7 : memref<16x8xi32, 2 : i32>
}
}
}
return
}
}
However, my current compilation error is:
mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel$ make clean && make
rm -rf build __pycache__
mkdir -p build
cd build && python3 mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py
Traceback (most recent call last):
File "mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py", line 35, in <module>
test_main(build_module, verbose=args.verbose)
File "mlir-air/programming_examples/matrix_scalar_add/common.py", line 54, in test_main
addone = backend.compile_and_load(mlir_module)
File "mlir-air/install-xrt/python/air/backend/xrt.py", line 166, in compile_and_load
c = self.compile(module)
File "mlir-air/install-xrt/python/air/backend/xrt.py", line 89, in compile
aircc.run(air_module, aircc_options)
File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 414, in run
run_passes(
File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 112, in run_passes
PassManager.parse(pass_pipeline).run(mlir_module.operation)
air._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: "-":21:9: branch has 0 operands for successor #0, but target block has 1
note: "-":21:9: see current operation: "cf.br"()[^bb2] : () -> ()
make: *** [Makefile:7: run] Error 1
I haven't pinned down exactly at what stage things seem to go wrong, but I thought I'd record my progress here!
@erwei-xilinx Right now I'm trying to percolate the value from the launch through the segment to the herd using parameters.
An alternative option I could think of would be to write the value to memory somewhere in the launch and use a channel to percolate it through to the herd.
Do you have any insight as to which strategy seems most reasonable?
Here's one example from mlir-aie which passes runtime parameters from host to the aie design: https://github.com/Xilinx/mlir-aie/blob/a764c8b7a6f944a5a892491ed781c81f618f1437/programming_examples/ml/conv2d_fused_relu/aie2.py#L232
Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.
Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.
Maybe you missed it, but there is work in progress here: https://github.com/Xilinx/mlir-air/pull/585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.
Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.
Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.
Oh yeah indeed I missed that PR. It would be great if we could use that feature to generate runtime parametrisable designs.
The air->airrt flow currently have three implementations to deal with aie1, aie2 and aie2 with objectFifo respectively. Would be super useful to unify them.
I tried unrolling the launches (so I'm manually creating 4 1x1 launches in a python for-loop). However, I get a segfault when I attempt to compile. I was wondering if anyone has a few minutes to check the new code (it is here) to see if it is a reasonable way to use the air abstractions (at which point I'll try to see how far I get in debugging) or if I should redesign it?
Something that I'm also a bit unclear on is how the original arguments to the module are sent/received when there are multiple launches. If someone could shed some light on that process, I'd appreciate it!
Yeah lowering a design with multiple air.launches is not yet well exercised.
Ah, so this issue wasn't actually fixed. I fixed the multi-launch example so it fails correctly in this PR here: https://github.com/Xilinx/mlir-air/pull/673
I'm working on an example where I have an 2D matrix of input data, then I break it into four data tiles, and then I am attempt to process one data tile per one compute tile in a variety of ways using AIR constructs. I am sanity checking my programs by making each compute core add a unique
tile_num
to each value in the data tile they modify, so I can reassure myself that the compute tile I think is doing some work is actually the compute tile doing the work.Anyways, I am trying to compose an example of this scenario that uses four launches, where the herd size is 1x1. My first attempt is here where I have a while loop within the herd because I hear the kernel will be persistent across launches.
Anyways, even with that persistence, I'd like to somehow parameterize the herd with the launch indices so I can calculate a unique
tile_num
per launch. Is this something that is possible to do? If not, how do I reassure myself that one data tile is being processed per launch?