IBM / semanticflowgraph

Semantic flow graphs for data science
Apache License 2.0
28 stars 10 forks source link

Ingesting julia programs #11

Open jpfairbanks opened 5 years ago

jpfairbanks commented 5 years ago

Based on my reading of the docs and paper, this project currently supports ingesting R and Python scripts but not Julia scripts. Is there a roadmap for julia support? I think that representing Julia scripts would be relatively straightforward with Base.Meta.parse. and MacroTools.

epatters commented 5 years ago

You are right that the project currently supports ingesting Python and R but not Julia. For the time being, I have my hands full trying to support Python and R with a reasonable degree of robustness, and I don't have the bandwidth to add Julia to the mix. That being said, if you (or anyone else) wants to take on that challenge, I would be happy to provide as much guidance, code review, and documentation as needed.

I haven't yet thought about how to do the code analysis for Julia, but I suspect you're right that Julia's strong support for metaprogramming will help.

jpfairbanks commented 5 years ago

So I am coming back around on this now that I understand wiring diagrams. What would it take to go from julia AST (Expr object) to a RawFlowGraph?

jpfairbanks commented 5 years ago

I've currently got a macro that takes a function definition and converts it to a Catlab.WiringDiagram. it handles things like

y = f(x)
z1 = g(y)
z2 = h(y, z1)

by "undoing" common subexpression elimination. So you the the WD corresponding to the AST

h(f(x), g(f(x)))

I'd like to get mcopy nodes in the WD, but that requires more bookkeeping, unless there is an elegant way to do it that exceeds my knowledge of WDs.

epatters commented 5 years ago

So here is the strategy that I currently use, in some form, for both the Python and R raw flow graphs. It is a hybrid (static/dynamic) analysis:

  1. Static phase: Rewrite the AST to add special function calls for recording the program, while preserving the program's behavior. Function calls, variable reads, and variable assignments must be handled, as well as language-specific features such as indexing.

  2. Dynamic phase: Run the rewritten program and use the recording events to build the raw flow graph during run-time. To handle variable reuse (copies), a table mapping variables to box output ports is maintained at every scope.

The main advantage of this approach is that it bypasses the decidability problems around static analysis. The major disadvantage, of course, is that you have to run the program!

It sounds like you're experimenting with static analysis. In fact, I think the purely static approach is worth exploring. Even in this setting, you should be able to handle variable reuse using the strategy outlined above. I would avoid creating explicit copy nodes until you've finished constructing the wiring diagram. Instead, just reuse the output ports by adding multiple wires out of them.

If it would be helpful, I'd be happy to discuss further by phone.

epatters commented 4 years ago

A simple version of this is now implemented in Catlab.jl (https://github.com/epatters/Catlab.jl/issues/52), solely as a convenient syntax for constructing wiring diagrams.

A version designed to handle more general Julia programs, comparable to the Python or R program analysis projects, would be a much larger undertaking.