Cell-level caching - Githubissues

JanPalasek commented 2 years ago

Context

I work with notebooks that are hard to compute. If I change one code cell at the end of the notebook, I do not expect that its entire cache will be invalidated and the notebook will be needed to be recomputed. I want only the dependent cells to be changed.

Proposal

Assumption: Notebooks are executed from top to bottom (I come from Quarto). If we work directly with Jupyter notebooks, imho we don't need caching like this. Jupyter does it pretty well on its own. I don't know if this package attempts to cover this case as well. I'd propose a cell-level cache. We would remember each cell individually. If its source code or output of that cell changes, we would only recompute the changed cell and all cells that came after. This would greatly improve performance when prototyping a notebook because we would only recompute dependent cells. I assume there is like 100 problems that I don't see. If you see any, please fill me in. It's also possible that this is a problem of Quarto and I missjudgedmisjudged the scope of this project.

Tasks and updates

Later if the proposed solution is viable.

chrisjsewell commented 2 years ago

Heya, Well the key problem (and for https://github.com/jupyter/nbclient/issues/248) is what would you cache? You can't start execution half way through a notebook unless you cached the entire state of the kernel, e.g. say you have three cells

a = 1

b = 2

c = a + b

You can't run from cell 3, unless you've cached (and reloaded) the variables a and b

I don't know of an easy way to do this robustly?

JanPalasek commented 2 years ago

Ye, sorry for the (duplicate?) issue. I looked into nbclient and it seemed like it might be something needed to be done in nbclient, though I'm not totally sure. I didn't see a method to skip the cache execution.

Ye that's true. I gave it a thought today and it might be done by serializing the entire kernel by dill. It has a function for that: dill.dump_module (previously dump_session). It supports all serialization of all base objects except for frame, generator, traceback. Works for pandas etc.. However, if some object that was used in the notebook wasn't supported by dill, it is always possible to use the current implementation of caching.

To make the caching efficient, we could make something like a check-point system: a checkpoint could be made after nth cell that would serialize the entire state. Each checkpoint would have a hash made of cell's source codes up to this cell. If any of the source codes changed, the cache could be invalidated.

Further optimization could be made to prevent cache from being so memory hungry, such as:

Developer usually modifies e.g. last 10 cells. The previous checkpoints could be much sparser and thus save the memory.
We could measure, which cells take the most of the execution time (threshold or statistics based on the previous executions) and we could make the cache right after that cell and drop some of the others.
Drop some of the caches with time with strategy like LFU? (Least Frequently Used) It would locate the checkpoints that weren't used much and delete them.
... ?

The main things that need to be imo tested:

Dill and some libraries people tend to use in their books / reports. Pandas (already tested), numpy, scipy, tensorflow, matplotlib, plotly etc.
Dill's performance for big objects, such as large pandas dataframe or large numpy (large numpy should be ok based on this SO post.

I'm very interested in your opinion about these suggestions. I could also potentially help with some of the tasks.

JanPalasek commented 1 year ago

@chrisjsewell Will you accept PR if someone manages to come up with a good solution? (probably taking some inspiration from knitr)

chrisjsewell commented 1 year ago

Heya yeh definitely interested thanks

executablebooks / jupyter-cache

Cell-level caching #89

Context

Proposal

Tasks and updates