Implementation of Fast Token Delivery flow within Dynamatic

Description

The work from Unleashing Parallelism in Elastic Circuits with Faster Token Delivery [1] is still missing in the codebase. The goal of this issue is to outline a roadmap for their implementation, highlighting what needs to be done.

The Fast Token Delivery (FTD) methodology is a different algorithm to build the handshake IR of the the cf IR. It gets rid of the concepts of basic blocks in the original CFG, relying only on the relationship between producers and consumers among different operations. In the paper, it was shown it can provide significant execution time advantage, together with a reduction of area utilization.

The methodology aims at getting rid of anything related to control dependencies between basic blocks. However, at a first stage of the implementation, the network of control merges is maintained. This allows to allocate the memory operations to the LSQ, and to indicate the termination of the execution of the kernel to the memory interfaces.

The idea of the issue is to design an alternative version of the current CfToHandshake pass (let's call it FtdCfToHandshake) which can be enabled through a flag at compile time (compile --fast-token-delivery). In this way, the current version of Dynamatic requires no change. Also, since this methodology is about the conversion from the cf dialect to the handshake dialect, the rest of the Dynamatic flow remains unchanged (considering both the transformations to the handshake IR and the following lowering to the hw dialect).

Steps

[x] Finish a Boolean Logic library in Dynamatic. The FTD methodology requires many analysis of boolean conditions within the circuit (it is mainly about understanding which conditions must be satisfied to go from node A to node B following a given path). A draft of this library is already present in the Experimental folder. However, some new features must be added in order for it to be used in the algorithm. We are aware that having a custom library might be redundant in the codebase - taking into account that Espresso is already among the dependencies of the project. Once that FTD is implemented, we aim at replacing the current custom library with something else.
[ ] Implement an analysis pass to obtain control dependency information on the CFG. This analysis pass (based on [4]) is a fundamental step of the FTD algorithm and the GSA conversion.
[ ] Implement an analysis pass which extracts the GSA information out of the MLIR's SSA representation. The FTD methodology is based on GSA, an alternative version to SSA which employs boolean conditions together with phi nodes (phi nodes are not explicit in MLIR, but encoded through block arguments in the cf dialect). There are many papers showing how to approach such a problem, such as [2] and [3]. The main consequence of having GSA is that, instead of using merge nodes for phi functions, we can go for multiplexers.
[ ] Implement the FTD methodology according to [1]. This step is the largest among the ones presented in the issue, as it requires to fully rewrite the CfToHandshake pass. In this first phase, the way memories are handled remains identical to what is done in the current Dynamatic (thus without SQ, using the network of control merges to allocate blocks in the LSQ). This can be further divided in the following steps:
- [ ] Initial pass structure and required classes (intermediate non complete step);
- [ ] Use the GSA information to get rid of block arguments in basic blocks and substitute them with multiplexers (intermediate non complete step);
- [ ] Use the start signal to trigger constants and undefined values in the circuit (intermediate non complete step);
- [ ] Add the regeneration mechanism for signals in loops (intermediate non complete step);
- [ ] Add the suppression mechanism for signals (final complete step);

Additional comments

As I mentioned above, the rest of the flow is not modified. Some issues might arise from the interaction with other passes. Since this is something not strictly related to the FTD algorithm (but rather, on the consistency of the whole flow) some new issues will be opened later on.

[1] A. Elakhras, A. Guerrieri, L. Josipović, and P. Ienne, “Unleashing parallelism in elastic circuits with faster token delivery,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), IEEE, 2022, pp. 253–261. Accessed: Oct. 14, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10035134/ [2] P. Havlak, “Construction of thinned gated single-assignment form,” in Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds., Berlin, Heidelberg: Springer, 1994, pp. 477–499. doi: 10.1007/3-540-57659-2_28. [3] P. Tu and D. Padua, “Efficient Building and Placing of Gating Functions,” ACM SIGPLAN Notices, vol. 30, no. 6, pp. 47–55, Jan. 1995, doi: 10.1145/223428.207115. [4] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,” ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987, doi: 10.1145/24039.24041.

@pcineverdies Thanks for formatting the PR! I will forward the issue to experts on this (TBH I am not familiar with the compiler things that you are doing here). In any case, here are some thoughts on the plan:

Comment on step 1:

We are aware that having a custom library might be redundant in the codebase - taking into account that Espresso is already among the dependencies of the project. Once that FTD is implemented, we aim at replacing the current custom library with something else.

I think the goal is to make the code introduced in every PR clean (which is the whole point of having a plan like this); now since we all agree that the Boolean library is redundant, it is better to clean it up in the first place.

Comment on step 2 and 3:

Implement an analysis pass which extracts the GSA information out of the MLIR's SSA representation

Isn't the plan to do a conversion pass from SSA to GSA, and then do something like GsaToHandshake? This implementation looks like a hack around the cf dialect though...

Update after talking to others It would be nice (in the issue description) to have a clear separation between the analysis pass and the conversion pass; so that in case we implement a GSA dialect in the future, many of the code can be reused.

Thanks a lot for doing this, it is great to be able to see a plan like this before PRs start pouring in. Few comments below.

We are aware that having a custom library might be redundant in the codebase - taking into account that Espresso is already among the dependencies of the project. Once that FTD is implemented, we aim at replacing the current custom library with something else.

Just to be clear, I (and everybody else I assume) have 0 emotional attachment to Espresso. I added it to the project over the summer because the people who did the initial work on this told me it was necessary. If it turns out not to be, I am happy to rm -r the folder.

Implement an analysis pass which extracts the GSA information out of the MLIR's SSA representation

Isn't the plan to do a conversion pass from SSA to GSA, and then do something like GsaToHandshake? This implementation looks like a hack around the cf dialect though...

It would be nice (in the issue description) to have a clear separation between the analysis pass and the conversion pass; so that in case we implement a GSA dialect in the future, many of the code can be reused.

Maybe we already all agree on this but I don't think that MLIR can truly support a GSA representation in the IR so it seems difficult to implement an "SSA to GSA" transformation pass (unless you are relying on special operation attributes perhaps?). To me an analysis pass seems like the right choice for this, as this allows you to maintain whatever "GSA information" you need in memory instead of in the IR itself.

For what regards the boolean library, I agree that it's important to come up with a clean implementation. Espresso is a nice software when it comes to boolean optimization, but it was not primarily designed to be a boolean logic library. Either I find a library which comprises both the functionalities (thus having a new reliable dependency, getting rid of both Espresso and the current hand-crafted library) or we continue with what we have right now. I'll be back with my proposal to this Issue in a couple of days.

Right now, GSA is implemented as an analysis pass which provides information to FtdCfToHandshake to perform the conversion (instead of instantiating MERGEs for the live-in values of each block it relies on MUXes). I see the point of having a gsa dialect as an intermediate step between cf and handshake. This would consist in making implicit SSA MLIR representation (which was a fundamental design choice in the framework) explicit.

Here is an example just for clarity.

C code

unsigned loop_sum(in_int_t a[N]) {
  int x = 0;
  for (int i = 0; i < N; i++) x += a[i];
  return x;
}

cf dialect

module {
  func.func @loop_sum(%arg0: memref<8xi32> {handshake.arg_name = "a"}) -> i32 {
    %c0 = arith.constant {handshake.name = "constant2"} 0 : index
    %c0_i32 = arith.constant {handshake.name = "constant3"} 0 : i32
    cf.br ^bb1(%c0, %c0_i32 : index, i32) {handshake.name = "br0"}
  ^bb1(%0: index, %1: i32):  // 2 preds: ^bb0, ^bb1
    %c8 = arith.constant {handshake.name = "constant4"} 8 : index
    %c1 = arith.constant {handshake.name = "constant5"} 1 : index
    %2 = memref.load %arg0[%0] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load1"} : memref<8xi32>
    %3 = arith.addi %1, %2 {handshake.name = "addi0"} : i32
    %4 = arith.addi %0, %c1 {handshake.name = "addi1"} : index
    %5 = arith.cmpi ult, %4, %c8 {handshake.name = "cmpi0"} : index
    cf.cond_br %5, ^bb1(%4, %3 : index, i32), ^bb2 {handshake.name = "cond_br0"}
  ^bb2:  // pred: ^bb1
    return {handshake.name = "return0"} %3 : i32
  }
}

gsa dialect (explicit phi without block arguments, see [2] for semantics)

module {
  func.func @loop_sum(%arg0: memref<8xi32> {handshake.arg_name = "a"}) -> i32 {
    %c0 = arith.constant {handshake.name = "constant2"} 0 : index
    %c0_i32 = arith.constant {handshake.name = "constant3"} 0 : i32
    cf.br ^bb1 {handshake.name = "br0"}
  ^bb1():  // 2 preds: ^bb0, ^bb1
    %0 = gsa.mu(%c0, %4); :index
    %1 = gsa.mu(%c0_i32, %3); :i32
    %c8 = arith.constant {handshake.name = "constant4"} 8 : index
    %c1 = arith.constant {handshake.name = "constant5"} 1 : index
    %2 = memref.load %arg0[%0] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load1"} : memref<8xi32>
    %3 = arith.addi %1, %2 {handshake.name = "addi0"} : i32
    %4 = arith.addi %0, %c1 {handshake.name = "addi1"} : index
    %5 = arith.cmpi ult, %4, %c8 {handshake.name = "cmpi0"} : index
    cf.cond_br %5, ^bb1, ^bb2 {handshake.name = "cond_br0"}
  ^bb2:  // pred: ^bb1
    return {handshake.name = "return0"} %3 : i32
  }
}

On the one hand, this latter approach seems cleaner, and it would allow to handle GSA gates as normal operations in the IR (rather than having a custom data-structure with the GSA information). On the other hand, as Lucas mentioned, gsa.mu and gsa.gamma would just be some placeholders without an intrinsic semantics attached, which might be an issue when it comes to consistency with the rest of the framework.

Currently - in our working branch - we are relying on the analysis pass, but it's also immediate to introduce a gsa dialect and an intermediate conversion (cfToGsa -> FtdGsaToHandshake). I'm open to both solutions!

gsa dialect (explicit phi without block arguments, see [2] for semantics)

That's what I meant, I am 99% sure MLIR won't let you do what you do with gsa.mu in your example. The reason is that you reference %4 and %3 before they are defined in the graph dominance sense, which is not allowed within an SSACFG region, the kind kind of region inside a func.func operation. Even if your operation's "internal semantics" would make it so that these values are never "picked" before ^bb1 executes once, this is completely opaque to MLIR.

FYI, there is another kind of region called a graph region which allows you to do such things as using an SSA value before it is defined in the IR (a non-external handshake.func operation contains a single of those regions) as long as you have a single basic block in the region, which is unfortunately not the case here.

Therefore I strongly suggest going for an analysis pass. I am not saying you could not dirtily hack around a semi-explicit version of GSA in the IR itself, but I'm saying it won't be practical to use, assuming it is even doable without losing one's sanity.

@pcineverdies @lucas-rami thank you for the clarifications and discussions! The current plan sounds good to me. I will give a look into #147

EPFL-LAP / dynamatic