Redesign `ControlID` to provide a better Interface

calebmkim commented 1 year ago

Overview Currently, the ControlID pass provides methods that can label the control stmts with u64's, and we can query for the label of a given control stmt. We want to improve this analysis to provide a better interface for the user.

There are two main ideas discussed in #1358. The first is a more minor change to improve the interface of ControlID, the second is a more major change to get ControlID to actually build a full CDFG.

Improve ControlID Interface to Give Tokens instead of u64's

The first thing would be to change Control ID so that we actually define instances of the ControlID struct, instead of it just providing functions to the user like it currently does. This way, when a ControlID object is instantiated, it automatically labels the nodes. Therefore, we know that as long as we are referring to the same ControlID object, we will have a consistent labeling of nodes.
We should also make the ControlID methods spit out tokens (i.e., Par(u64), see: New Type Idiom) instead of u64. This way, we can build an interface in which the users of the ControlID Analysis never actually look at the u64's inside the tokens.
One thing to note is that sometimes, we might sometimes want different "labels" for different tokens. For example, sometimes we will want a ParThread(u64) to signify the control stmt is a direct child of a par block, but other times, we might want the same control statement to be wrapped as a Seq(u64) (or whatever that control statement actually is).
Finally, we will probably want to produce some sort of error if the user makes an "incorrect" query: for example, if the user asks for a ParThread(u64) token for a control statement that isn't actually the direct child/thread of a par block, then we will probably want to produce an error

ControlID should produce a CFG

ControlIDshould take in a control block, and internally build a full CFG. Then, users should just be able to make queries about the CFG, which ControlID should also support by internally analyzing the CFG that it has built.
1225 has already discussed this same idea. For each possible control stmt (invoke, enable, par, seq, while, if), the corresponding graph will have exactly one entry node and one exit node. Also, syncs are given their own nodes, which makes sense. It also include labels for edges, which the program should take based on whether the conditional is true or not.
One thing to add to this proposal is some notion of data flow/dependency: this could be helpful if we want to do some sort of "unsharing" in order to increase parallelization.
One broader question I have is that in #1225 @rachitnigam mentioned that we should be trying to build a CDFG, somewhat like LLVM does. On that note, there are still a few things where I'm unclear: 1) CDFGs, from what I understand, have basic blocks (i.e., blocks of instructions that you know they are just going to always just execute sequentially). We could try to support something like this for Calyx, organizing the graph into blocks of groups that we know are always just going to execute sequentially. That seems a bit different than the current sketch, where single groups/invokes themselves seem to be the "basic blocks" of the CFG. 2) @rachit also mentioned SSA form. Is this something that we should try to implement for Calyx? It seems like we would have to aggresively instantiate new cells if we were to do this.

rachitnigam commented 1 year ago

Thank you for the super awesome writeup @calebmkim!

One thing to note is that sometimes, we might sometimes want different "labels" for different tokens.

This is something I've been pondering as well–we use the controlID stuff in slightly, but importantly, different ways in various passes like TDCC (which, AFAIK, doesn't even use ControlID). I think a good question is what would be a good representation/set of node ID that works for every pass?

ControlIDshould take in a control block, and internally build a full CFG.

We can even do one step further and change ControlID::new to take full ownership of the ir::Control struct. This means that instead of indirectly referencing to nodes in a modified ir::Control, the user must "detach" the control program and give it to the Analysis. This way, we can make queries like "give me an iterator over the control program in the order of the control flow edges".

Once a pass is done with it's CFG-based computation, we can use a ControlID::take function to return the original control program.

The big thing to note about this design point is that it doesn't let you add anything to the CFG, just query it and modify the groups underneath it. However, it is still powerful for analyses. Maybe we can cleverly design a Visitor-like interface for CFG-based passes that gives us control-flow order of things as well as the ability to modify it

One thing to add to this proposal is some notion of data flow/dependency: this could be helpful if we want to do some sort of "unsharing" in order to increase parallelization.

Yes! In fact, the current unsharing pass does something similar in computing a CFG but does so in a hacky way. @EclecticGriffin might have more insight on how exactly it works.

CDFGs, from what I understand, have basic blocks (i.e., blocks of instructions that you know they are just going to always just execute sequentially

I think we can use groups in place of basic blocks. Also, it is not necessary for us to build a CDFG for the kinds of passes we perform today but maybe useful to think about in the future.

sampsyo commented 1 year ago

Yes indeed; thanks for the really useful writeup, @calebmkim!

I think you did a good job mapping out what it would take to approach either of these two ideas. Perhaps it goes without saying, but this discussion also implicates the idea of "flattening" the entire IR so that integer IDs are already available (they are the way you identify control nodes), as in #1183 and friends.

I think it would be good to direct the discussion toward deciding what we will do: one of these, both, or some kind of intermediate point? Maybe a way to make that decision would be to resolve this question from @rachitnigam:

I think a good question is what would be a good representation/set of node ID that works for every pass?

Could we do an audit of the way in which these IDs are use so we can confidently describe what all these passes want from a set of IDs?

calyxir / calyx

Redesign `ControlID` to provide a better Interface #1366