executablebooks / jupyter-cache

A defined interface for working with a cache of executed jupyter notebooks
https://jupyter-cache.readthedocs.io
MIT License
49 stars 14 forks source link

Cell-level caching #89

Open JanPalasek opened 2 years ago

JanPalasek commented 2 years ago

Context

I work with notebooks that are hard to compute. If I change one code cell at the end of the notebook, I do not expect that its entire cache will be invalidated and the notebook will be needed to be recomputed. I want only the dependent cells to be changed.

Proposal

Assumption: Notebooks are executed from top to bottom (I come from Quarto). If we work directly with Jupyter notebooks, imho we don't need caching like this. Jupyter does it pretty well on its own. I don't know if this package attempts to cover this case as well. I'd propose a cell-level cache. We would remember each cell individually. If its source code or output of that cell changes, we would only recompute the changed cell and all cells that came after. This would greatly improve performance when prototyping a notebook because we would only recompute dependent cells. I assume there is like 100 problems that I don't see. If you see any, please fill me in. It's also possible that this is a problem of Quarto and I missjudgedmisjudged the scope of this project.

Tasks and updates

Later if the proposed solution is viable.

chrisjsewell commented 2 years ago

Heya, Well the key problem (and for https://github.com/jupyter/nbclient/issues/248) is what would you cache? You can't start execution half way through a notebook unless you cached the entire state of the kernel, e.g. say you have three cells

a = 1
b = 2
c = a + b

You can't run from cell 3, unless you've cached (and reloaded) the variables a and b

I don't know of an easy way to do this robustly?

JanPalasek commented 2 years ago

Ye, sorry for the (duplicate?) issue. I looked into nbclient and it seemed like it might be something needed to be done in nbclient, though I'm not totally sure. I didn't see a method to skip the cache execution.

Ye that's true. I gave it a thought today and it might be done by serializing the entire kernel by dill. It has a function for that: dill.dump_module (previously dump_session). It supports all serialization of all base objects except for frame, generator, traceback. Works for pandas etc.. However, if some object that was used in the notebook wasn't supported by dill, it is always possible to use the current implementation of caching.

To make the caching efficient, we could make something like a check-point system: a checkpoint could be made after nth cell that would serialize the entire state. Each checkpoint would have a hash made of cell's source codes up to this cell. If any of the source codes changed, the cache could be invalidated.

Further optimization could be made to prevent cache from being so memory hungry, such as:

The main things that need to be imo tested:

I'm very interested in your opinion about these suggestions. I could also potentially help with some of the tasks.

JanPalasek commented 1 year ago

@chrisjsewell Will you accept PR if someone manages to come up with a good solution? (probably taking some inspiration from knitr)

chrisjsewell commented 1 year ago

Heya yeh definitely interested thanks