datamade / Pweave

Pweave is a scientific report generator and a literate programming tool for Python. It can capture the results and plots from data analysis and works well with numpy, scipy and matplotlib.
http://mpastell.com/pweave
Other
2 stars 0 forks source link

Start implementing caching for unchanging code blocks #1

Open fgregg opened 6 years ago

fgregg commented 6 years ago

In order for this to be useful, we need to restore the state of variables after each code chunk. (or equivalently restore the a snapshot of the kernel as it was at the end of each code chunk)

Relevant work:

https://stackoverflow.com/questions/633127/viewing-all-defined-variables?noredirect=1&lq=1 https://stackoverflow.com/questions/34342155/how-to-pickle-or-store-jupyter-ipython-notebook-session-for-later https://github.com/yihui/knitr/blob/master/R/cache.R https://beta.observablehq.com/@mbostock/how-observable-runs https://github.com/dataflownb/dfkernel https://multithreaded.stitchfix.com/blog/2017/07/26/nodebook/

fgregg commented 6 years ago

Here's Brandon Willard talking about the approaches he was thinking about:

https://github.com/mpastell/Pweave/issues/19#issuecomment-381015795

brandonwillard commented 6 years ago

@fgregg, looks like you're going down the same rabbit hole that I revisit every year!

A lot of the possible (and worthwhile) functionality should exist in its own project. For instance, the type of caching that invalidates entries based on a variable/call dependency graph, efficient incremental bytecode storage and updating (good for interactive sessions), automatic caching based on instrumentation/runtimes, etc.

As we discussed, there are necessarily low-level language implementation details, but a considerable amount of the logic surrounding those functions can be abstracted at a level sufficient for orchestration in Python (perhaps within the Jupyter framework).

If you're interested, we could probably knock out a full fledged example for Python sessions/kernels and start the generalization from there.

fgregg commented 6 years ago

That sound great, @brandonwillard. Do you have a suggestion about how we should proceed?

piccolbo commented 6 years ago

I was wondering if you are aware of the dill package and that it offers the ability to save a session. From https://pypi.python.org/pypi/dill:

dill provides the ability to save the state of an interpreter session in a single command

I have not used this feature, but I have used other serialization features and I think it's a high quality package and the dev is very responsive. If storing the entire session is agreeable, then it's a pretty simple algorithm: restore session before chunk that has changed; evaluate chunk; save session after chunk. If any change in the saved session, re-evaluate next chunk. Premature optimization etc etc

fgregg commented 6 years ago

Thanks for sharing that, @piccolbo.

It's helped me see that serializing and restoring sessions is not something that I want.

Here's a common pattern I have.

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@

I don't really want to serialize the connection or the cursor (neither of which can really be serialized anyway). Even if I could sanely serialize and hydrate the connections, I wouldn't want the rehydration of expensive_query to clobber the value of conn or c which restoring the session would do.

brandonwillard commented 6 years ago

@piccolbo, yes, I was originally using dill's session pickling in my Pweave caching branch (it's commented-out in that commit, though), but — as @fgregg said — it was overkill. That's also what motivated my thinking about incremental caching.

@fgregg, regarding first steps, here's a small example of AST-based "assign" statement caching. I left out any considerations for out-of-scope variable assigns within functions and classes, as well as variable-to-block dependencies (the kind that would invalidate caches for dependent blocks), etc.

This is the kind of thing that might work for org-mode and Pweave, but I think it should exist closer to the Jupyter level (e.g. in a client or kernel). Anyway, from here, we should figure out exactly where this sort of logic should exist and start considering which other languages to support and whether or not we can easily tease-out assign statements. Either that or start on the aforementioned missing features.

One idea I kept having involved Pygments. It has a wide array of lexers and they might be useful for obtaining assigns in a similar, and highly generic, way. That approach is extremely limited, but somewhat promising for a broad, slightly-less-than-naive caching similar to the example given here. Tools like this and, say, Antlr are a nice way to keep the work going in one language (e.g. Python) while covering more languages. Plus, much of what we've discussed doesn't directly rely on bytecode compilation and/or execution; for instance, the caller's frame and exec-like expressions in my example code can — and probably should — be replaced with remote code execution calls via Jupyter.

fgregg commented 6 years ago

Thanks for sharing that @brandonwillard.

It looks like you are hoping to implement a dependency graph to invalidate code the cached if a dependency changes.

That's a very neat idea, but that would seem to assume that none of the dependencies were connections to external resources.

In the example code, I posted above, conn and c are not serializable so anything that depended upon them would also need to be re-run? In a certain way that seems sane, because this code can't know if I have made changes to database I'm connecting to. However, while that's probably the most conservative behavior, it's not the one I want?

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@
brandonwillard commented 6 years ago

What I think is reasonable to implement in caching logic currently stops short at the bytecode level.

Nonetheless, there are worthwhile, albeit less-than-automatic, ways around specific problems — like the one you mention — and I imagine most would involve some intervention by the user (e.g. specifying a form to evaluate that would determine whether or not remote content was changed).

If you wanted to get really fancy, though, you could add caching logic with awareness specific to libraries like psycopg2. In this specific instance, it's quite possible (but maybe not worthwhile) to determine the tables involved in the relevant data-generating queries, build dependency graphs for those, and use the same caching logic. Even better, if this sort of logic was tailored for data-frame abstraction libraries like Ibis and Blaze, it might be easier to implement and be more applicable!

fgregg commented 6 years ago

Okee doke. I think I'm going to start by implementing what knitr does by default.

In pseudocode:

old_globals = globals().copy()
for chunk in chunks:
    if chunk in cache:
        chunk.results = cache[chunk].results
        globals().update(cache[chunk].objects)       
    else:
        result = eval(chunk)
        cache[chunk].results = result
        cache[chunk].objects = globals() - old_globals # what's been changed in globals by this code chunk   
    old_globals = globals().copy()

I think there's better things we can do in the future, but this seems pretty simple and from working with knitr, it seems acceptable.

brandonwillard commented 6 years ago

With source string hashing chunk validation, though, right?

fgregg commented 6 years ago

Yes.

brandonwillard commented 6 years ago

By the way, a great next step could involve AST-based hashing for cache and/or block validation, instead of string hashing. That way, inconsequential string changes in blocks wouldn't affect the cache (e.g. white spaces, comments, reorderings, var name changes, etc.)

fgregg commented 6 years ago

I agree! Although, from working with knitr, I've often found is useful to bust the cache by adding in meaningless white space. Haha.

brandonwillard commented 6 years ago

Ha, yeah, I've had to do that, as well. However, I think this functionality could be provided by a [temporary] chunk option.

You know, this whole idea of AST-based caching really makes me wish I was working with JVM languages again. Seems like one could cover a whole lot of ground with a unified bytecode like Java's.