Open hunhoffe opened 1 week ago
Thanks for investigating into worker-to-worker communication scenario. This scenario hasn't been exercised, so a bug at air compiler is possible. The simpleDMAChannelAllocation
is implemented with a naive logic to allocate DMA channels to data movements with endpoints of different memory space; in your case the endpoints are both L1, which might lead to unexpected behaviour.
And this assumption could affect other methods and passes in air compiler.
Ah, I see!
As a path forward, would it make sense to:
An update on this plan:
I will leave this issue up until the l1-l1 worker-to-self example is supported (and the lit test passes) or it is decided this is definitely an illegal action, which can be caught by the verifier (hopefully with a helpful error).
This is an issue that directly addresses accessing l2 allocations in the channel: https://github.com/Xilinx/mlir-air/issues/666
This PR takes that fix and attempts to fix the worker-to-self example using this new capability; however, it still fails with:
python3: mlir-air/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:956: void xilinx::air::simpleDMAChannelAllocation(std::vector<MemcpyBundleAsFlow> &, ShimDMAAllocator &, MemTileDMAAllocator &, TileDMAAllocator &): Assertion `core' failed.
Aborted (core dumped)
I am working on writing a worker-to-worker data transfer example for channels (as part of the grouping of examples that exercise various features of channels, https://github.com/Xilinx/mlir-air/issues/648).
Draft PR is here: https://github.com/Xilinx/mlir-air/pull/653
I am basing it off the code in the channel_size example (PR waiting to be merged here: https://github.com/Xilinx/mlir-air/pull/642)
The channel_size example works well for me. As an intermediate step to adding worker-to-worker communication to that example, I tried to have each worker send data to itself over a channel. That is the version of the code that is pushed in the draft PR https://github.com/Xilinx/mlir-air/pull/653. The particular file of interest is this one.
When I run with this intermediate step, I get the following error:
My question is: