Multiple launches, herd with single core

hunhoffe commented 4 months ago

I'm working on an example where I have an 2D matrix of input data, then I break it into four data tiles, and then I am attempt to process one data tile per one compute tile in a variety of ways using AIR constructs. I am sanity checking my programs by making each compute core add a unique tile_num to each value in the data tile they modify, so I can reassure myself that the compute tile I think is doing some work is actually the compute tile doing the work.

Anyways, I am trying to compose an example of this scenario that uses four launches, where the herd size is 1x1. My first attempt is here where I have a while loop within the herd because I hear the kernel will be persistent across launches.

Anyways, even with that persistence, I'd like to somehow parameterize the herd with the launch indices so I can calculate a unique tile_num per launch. Is this something that is possible to do? If not, how do I reassure myself that one data tile is being processed per launch?

erwei-xilinx commented 4 months ago

We do not have an example in AIR today to parametrise herd using launch induction variables, or any runtime scalar parameter passed in from the host. For now, we only have launch induction variables parametrising the SHIM DMA BDs for streaming the correct data into the herd.

Would be really cool to have an example in AIR which can lower any runtime parameters into herd as MLIR-AIE RTP.

hunhoffe commented 4 months ago

I'm trying my hand at this to see if I can figure it out.

My example is here: https://github.com/Xilinx/mlir-air/blob/1bbc92a4b1dcaa66cbab50dc494af53bec1d8472/programming_examples/matrix_scalar_add/multi_launch_channel/multi_launch_channel.py

The initial AIR MLIR looks somewhat reasonable (to my novice eye):

#map = affine_map<()[s0] -> (s0 * 16)>
#map1 = affine_map<()[s0, s1] -> (s0 + s1)>
module {
  air.channel @ChanIn []
  air.channel @ChanOut []
  func.func @copy(%arg0: memref<32x16xi32>, %arg1: memref<32x16xi32>) {
    %c2 = arith.constant 2 : index
    %c2_0 = arith.constant 2 : index
    air.launch (%arg2, %arg3) in (%arg4=%c2, %arg5=%c2_0) args(%arg6=%arg0, %arg7=%arg1) : memref<32x16xi32>, memref<32x16xi32> {
      %0 = affine.apply #map()[%arg2]
      %1 = affine.apply #map()[%arg3]
      %2 = affine.apply #map1()[%0, %arg3]
      %3 = arith.index_cast %2 : index to i32
      %c8 = arith.constant 8 : index
      %c16 = arith.constant 16 : index
      %c32 = arith.constant 32 : index
      %c1 = arith.constant 1 : index
      air.channel.put  @ChanIn[] (%arg6[%0, %1] [%c8, %c16] [%c32, %c1]) : (memref<32x16xi32>)
      %c8_1 = arith.constant 8 : index
      %c16_2 = arith.constant 16 : index
      %c32_3 = arith.constant 32 : index
      %c1_4 = arith.constant 1 : index
      air.channel.get  @ChanOut[] (%arg7[%0, %1] [%c8_1, %c16_2] [%c32_3, %c1_4]) : (memref<32x16xi32>)
      air.segment @seg  args(%arg8=%3) : i32 {
        %c1_5 = arith.constant 1 : index
        %c1_6 = arith.constant 1 : index
        air.herd @xaddherd  tile (%arg9, %arg10) in (%arg11=%c1_5, %arg12=%c1_6) args(%arg13=%arg8) : i32 {
          %alloc = memref.alloc() : memref<16x8xi32, 2 : i32>
          %alloc_7 = memref.alloc() : memref<16x8xi32, 2 : i32>
          air.channel.get  @ChanIn[] (%alloc[] [] []) : (memref<16x8xi32, 2 : i32>)
          %c0 = arith.constant 0 : index
          %c8_8 = arith.constant 8 : index
          %c1_9 = arith.constant 1 : index
          scf.for %arg14 = %c0 to %c8_8 step %c1_9 {
            %c0_10 = arith.constant 0 : index
            %c16_11 = arith.constant 16 : index
            %c1_12 = arith.constant 1 : index
            scf.for %arg15 = %c0_10 to %c16_11 step %c1_12 {
              %4 = memref.load %alloc[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
              %5 = arith.addi %4, %arg13 : i32
              memref.store %5, %alloc_7[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
            }
          }
          air.channel.put  @ChanOut[] (%alloc_7[] [] []) : (memref<16x8xi32, 2 : i32>)
          memref.dealloc %alloc : memref<16x8xi32, 2 : i32>
          memref.dealloc %alloc_7 : memref<16x8xi32, 2 : i32>
        }
      }
    }
    return
  }
}

However, my current compilation error is:

mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel$ make clean && make
rm -rf build __pycache__
mkdir -p build
cd build &&  python3 mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py
Traceback (most recent call last):
  File "mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py", line 35, in <module>
    test_main(build_module, verbose=args.verbose)
  File "mlir-air/programming_examples/matrix_scalar_add/common.py", line 54, in test_main
    addone = backend.compile_and_load(mlir_module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 166, in compile_and_load
    c = self.compile(module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 89, in compile
    aircc.run(air_module, aircc_options)
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 414, in run
    run_passes(
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 112, in run_passes
    PassManager.parse(pass_pipeline).run(mlir_module.operation)
air._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: "-":21:9: branch has 0 operands for successor #0, but target block has 1
 note: "-":21:9: see current operation: "cf.br"()[^bb2] : () -> ()
make: *** [Makefile:7: run] Error 1

I haven't pinned down exactly at what stage things seem to go wrong, but I thought I'd record my progress here!

hunhoffe commented 4 months ago

@erwei-xilinx Right now I'm trying to percolate the value from the launch through the segment to the herd using parameters.

An alternative option I could think of would be to write the value to memory somewhere in the launch and use a channel to percolate it through to the herd.

Do you have any insight as to which strategy seems most reasonable?

erwei-xilinx commented 4 months ago

Here's one example from mlir-aie which passes runtime parameters from host to the aie design: https://github.com/Xilinx/mlir-aie/blob/a764c8b7a6f944a5a892491ed781c81f618f1437/programming_examples/ml/conv2d_fused_relu/aie2.py#L232

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

fifield commented 4 months ago

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: https://github.com/Xilinx/mlir-air/pull/585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

erwei-xilinx commented 4 months ago

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

Oh yeah indeed I missed that PR. It would be great if we could use that feature to generate runtime parametrisable designs.

The air->airrt flow currently have three implementations to deal with aie1, aie2 and aie2 with objectFifo respectively. Would be super useful to unify them.

hunhoffe commented 4 months ago

I tried unrolling the launches (so I'm manually creating 4 1x1 launches in a python for-loop). However, I get a segfault when I attempt to compile. I was wondering if anyone has a few minutes to check the new code (it is here) to see if it is a reasonable way to use the air abstractions (at which point I'll try to see how far I get in debugging) or if I should redesign it?

Something that I'm also a bit unclear on is how the original arguments to the module are sent/received when there are multiple launches. If someone could shed some light on that process, I'd appreciate it!

erwei-xilinx commented 4 months ago

Yeah lowering a design with multiple air.launches is not yet well exercised.

hunhoffe commented 3 months ago

Ah, so this issue wasn't actually fixed. I fixed the multi-launch example so it fails correctly in this PR here: https://github.com/Xilinx/mlir-air/pull/673

Xilinx / mlir-air

Multiple launches, herd with single core #627