Intermediate Representations and Static Analysis

I wanted to open this issue to make sure we wrote down some thoughts we had after @yifanwu and I met with Sarah Chasins, Rohan, and Rolando yesterday at Berkeley.

Notes from Sarah

We presented Sarah with a list of our use cases, which I believe included:

Program slicing
Automatic caching
Semantic Q&A (what graphs used this column from this spreadsheet?)
Cross-compilation/productional/parallelization (take my pandas and turn it into SQL, divide my python script into parallel tasks and use airflow to run them all)

We then asked her if she had any pointers on theoretical foundations that might make sense to try to build on. We had mentioned that we have something working, that is a mix between static and runtime analysis, but that as we kept iterating we felt like we kept uncovering more holes. And we were curious if there were some possible insight or ways of understanding the problem that might be helpful to design a robust core to build from.

From the high level, she said something like "If you can find the right form for your intermediate representation (IR), or number of forms for different stages, then many of these problems become easier to solve." In terms of pointers to existing work for describing these types of IRs, she pointed us to:

having some version of static single assignment (SSA) form at some point would likely be useful in our pipeline
Existing literature on a "Program Dependence Graph" (PDG) might be helpful for the type of representation we want. She mentioned she would send us some useful references here if we emailed her (@yifanwu could you post those once she gets back to you?)
She also mentioned looking at recent work on cheating detection, which needs similar ways of determining the semantic equivalence between programs.

More generally she also mentioned that although we might not be able to re-use the code from previous academic work in this area, the ideas might be useful

@yifanwu please edit or comment with anything else or other interpretations you had of our conversation!

Static Type Analysis

If we did want to move into static analysis, one big step would be designing a type system that could map well enough to the type of constructs we see in Python.

I did some version of this in metadsl, to be able to do all the type analysis before any execution. For example using this definition of a "Vector" class, the type analysis lets us know that the result of this code is an int, before computing:

x = Vec.create(1, 2, 3)
res = x[Integer.create(0)]

In my current type analysis, I was using Python's type objects as the values to manipulate, which I was in the process of refactoring to extract them out into a more explicit structure. But the basic operations involve having to unify different mappings of type variables to values, and then matching concrete and generic types, to extract out mappings. For example, if you have Vec.create(1), we know that the T is now int, in the definition of the create method, so that the resulting vector is Vec[T=int].

However, you notice that here I am basically making a nice pure functional DSL embedded in Python, where I provide all the typings so that it can be done precisely. Our task is harder, in that we want to meet folks where they are at, and the type analysis for something like pandas is much hairier.

SSA

I also think it might be worth creating an explicit SSA form, in between something like the bytecode or AST and the dataflow graph. The form would break up our compilation, so we can do all the scope analysis and block analysis, to get to the SSA, and then from there have a higher level to start on to get to the dataflow or PDG.

The bytecode is pretty close, we would just have to traverse it and evaluate the bytecode stack abstractly, in order to understand how the values flow through it. We would also need some form of python program stack, to understand the different scope levels. Making an SSA form that maps faithfully back and forth to bytecode would probably teach us something about the problem space, and is still relatively well constrained to not be an endless exploration.

LineaLabs / lineapy

Intermediate Representations and Static Analysis #291

Notes from Sarah

Static Type Analysis

SSA