bentsherman / codeflow

Flow graphs for your code
0 stars 0 forks source link

Add data flow graphs #3

Open bentsherman opened 2 years ago

bentsherman commented 2 years ago

Currently we can generate control flow graphs, which show the flow of execution like a traditional flow chart. The next step is to leverage this approach to generate data flow graphs, which show the flow of data dependencies through tasks.

We still need to think about exactly what we mean by a "data flow graph". For example, there doesn't seem to be a UML diagram that corresponds to a DFG. The closest thing might be the activity diagram, which is really more like a control-flow graph. Furthermore, data flow analysis seems to be different from what we're trying to do.

We can look to dataflow programming for guidance. The goal is to show the data dependencies, and thereby the potential task parallelism. Tasks that do not share data dependencies can be performed in parallel, and the dataflow graph makes this clear. The dataflow graph is also a nice visual description of the code, just like the CFG.

It would also be interesting to try to show the potential data parallelism within loops. We can consider each iteration of a loop as a separate task, and if subsequent iterations don't depend on each other then they can be performed in parallel.

Pipeline parallelism is like a combination of task and data parallelism, in which there are a sequence of tasks but the entire sequence is performed on multiple inputs, so the sequence forms a "pipeline" that can process inputs in parallel.

In summary, the dataflow graph is basically visualizing the code as a circuit diagram. Whereas a CFG models statements as nodes and execution paths as edges, the DFG is a bit more precise because it explicitly defines the inputs and outputs to each task. Additionally, the task decomposition can be coarse-grained or fine-grained or anywhere in between, but that is really a separate discussion.

bentsherman commented 2 years ago

Some more thoughts now that I have a better understanding of the AST and how to traverse it.

To construct a data flow graph, we need to:

  1. decompose the source code into tasks
  2. identify the inputs and outputs for each task
  3. connect the dots

The simplest way to start will be to keep each statement as a separate task, like the control flow graph. Then we have to traverse each statement, identify variables, and determine whether each variable is an input or output. based on whether it is being modified.

After defining the tasks and their inputs/outputs, constructing the entire graph should be straightforward, but there is one caveat -- we can't simply equate variables by their name, because variables can be overwritten. Not sure yet how to properly deal with that.