Review notebook cacheing and execution packages - Githubissues

executablebooks / jupyter-cache

A defined interface for working with a cache of executed jupyter notebooks

https://jupyter-cache.readthedocs.io

MIT License

49 stars 14 forks source link

Review notebook cacheing and execution packages #3

Open choldgraf opened 4 years ago

choldgraf commented 4 years ago

A place to discover and list other tools that do some form of notebook cacheing / execution / storage abstractions

Scrapbook (metadata tagging for python objects and cell outputs)
Bookstore (storage layer on S3 for notebooks)
Zarr (chunked storage interface https://zarr.readthedocs.io/en/stable/)

chrisjsewell commented 4 years ago

tinydb is a well-used, lightweight package with a simple JSON database API. Different storage classes can be used, which can also be wrapped in Middleware to customise their behaviour:

>>> from tinydb.storages import JSONStorage
>>> from tinydb.middlewares import CachingMiddleware
>>> db = TinyDB('/path/to/db.json', storage=CachingMiddleware(JSONStorage))

chrisjsewell commented 4 years ago

scrapbook contains (in-memory only) classes to represent a collection of notebooks Scrapbook, and a single notebook Notebook.

Of note, is that these have methods for returning notebook/cell execution metrics (like time taken), which they presumably store during notebook execution.

They also provide methods to access 'scraps' which are outputs stored with name identifiers (see ExecutableBookProject/myst_parser#46)

chrisjsewell commented 4 years ago

This is the link to the cacheing currently implemented by @mmcky and @AakashGfude: https://github.com/QuantEcon/sphinxcontrib-jupyter/blob/b5d9b2e77fdc571c4c718e67847020625d096d6d/sphinxcontrib/jupyter/builders/jupyter_code.py#L119

chrisjsewell commented 4 years ago

Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.

chrisjsewell commented 4 years ago

rossant/ipycache (last commit 2016), SmartDataInnovationLab/ipython-cache (last commit 2018) are both examples of cell level magics that pickle the outputs of cells for later use.
mkery/Verdant (last commit Oct 24, 2019) is a JupyterLab extension that automatically records the 'history' of Jupyter notebook cells, and stores them in a .ipyhistory JSON file. Note, the code is all written in TypeScript.

choldgraf commented 4 years ago

Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.

I think this is the kinda thing that some more bespoke notebook UIs do. E.g., I believe that Gigantum.IO (a proprietary cloud interface for notebooks) commits notebooks to a git repository on-the-fly, and then gives you the option to go back in history if needed. I don't believe they do any execution cacheing, just content cacheing

eldad-a commented 4 years ago

Thank you for creating this helpful resource!

As I am on the search myself, here is another pointer (which I still need explore):

dask.cache and cachey