LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
662 stars 58 forks source link

Intermediate Representations and Static Analysis #291

Open saulshanabrook opened 2 years ago

saulshanabrook commented 2 years ago

I wanted to open this issue to make sure we wrote down some thoughts we had after @yifanwu and I met with Sarah Chasins, Rohan, and Rolando yesterday at Berkeley.

Notes from Sarah

We presented Sarah with a list of our use cases, which I believe included:

We then asked her if she had any pointers on theoretical foundations that might make sense to try to build on. We had mentioned that we have something working, that is a mix between static and runtime analysis, but that as we kept iterating we felt like we kept uncovering more holes. And we were curious if there were some possible insight or ways of understanding the problem that might be helpful to design a robust core to build from.

From the high level, she said something like "If you can find the right form for your intermediate representation (IR), or number of forms for different stages, then many of these problems become easier to solve." In terms of pointers to existing work for describing these types of IRs, she pointed us to:

More generally she also mentioned that although we might not be able to re-use the code from previous academic work in this area, the ideas might be useful

@yifanwu please edit or comment with anything else or other interpretations you had of our conversation!

Static Type Analysis

If we did want to move into static analysis, one big step would be designing a type system that could map well enough to the type of constructs we see in Python.

I did some version of this in metadsl, to be able to do all the type analysis before any execution. For example using this definition of a "Vector" class, the type analysis lets us know that the result of this code is an int, before computing:

x = Vec.create(1, 2, 3)
res = x[Integer.create(0)]

In my current type analysis, I was using Python's type objects as the values to manipulate, which I was in the process of refactoring to extract them out into a more explicit structure. But the basic operations involve having to unify different mappings of type variables to values, and then matching concrete and generic types, to extract out mappings. For example, if you have Vec.create(1), we know that the T is now int, in the definition of the create method, so that the resulting vector is Vec[T=int].

However, you notice that here I am basically making a nice pure functional DSL embedded in Python, where I provide all the typings so that it can be done precisely. Our task is harder, in that we want to meet folks where they are at, and the type analysis for something like pandas is much hairier.

SSA

I also think it might be worth creating an explicit SSA form, in between something like the bytecode or AST and the dataflow graph. The form would break up our compilation, so we can do all the scope analysis and block analysis, to get to the SSA, and then from there have a higher level to start on to get to the dataflow or PDG.

The bytecode is pretty close, we would just have to traverse it and evaluate the bytecode stack abstractly, in order to understand how the values flow through it. We would also need some form of python program stack, to understand the different scope levels. Making an SSA form that maps faithfully back and forth to bytecode would probably teach us something about the problem space, and is still relatively well constrained to not be an endless exploration.

saulshanabrook commented 2 years ago

I just had a call with @marov, to go through why we aren't currently supporting static analysis and talked about how if we wanted to do that, we would have to do type analysis of Python and annotate every function that users call with the proper typings.