WarwickCIM / IM939_handbook

Data Science Across Disciplines' handbook
https://warwickcim.github.io/IM939_handbook/
0 stars 1 forks source link

Remove executed cells from notebooks #4

Closed ccamara closed 3 weeks ago

ccamara commented 1 year ago

AIM: Remove executed cells from notebooks + prevent them from being included while making sure that the rendered book contains the executed cells.

Detailed explanation:

Currently, the handbook uses a mix of markdown (for static pages with no code that needs to be run) and jupyter notebooks (and quarto syntax using jupyterlab-quarto extension). These notebooks serve two different purposes:

  1. Create the online handbook (quarto will render them and will generate the handbook using quarto render or quarto publish gh-pages)
  2. Be downloaded by the students to be used during the workshop sessions from main branch

Jupyter notebooks store the output of any cell that has been executed. Besides adding unnecessary commits to git history and making diffs difficult to read, it is problematic because students will see the resulting code without needing to run the code cells by themselves. We want them to experiment with code and to (at least) render it by themselves.

We are currently addressing that by manually clearing the code cells and then pushing to the repo. Sadly, this is prone-error, especially considering that whenever quarto renders the book, the cells are run and therefore, jupyter notebooks are modified. This means that it is easy not to notice and include those changes with the commit that renders the book.

We should not rely on manual supervision. Instead, an automated approach should be implemented.

After some initial research, it seems that a promising approach would be to use pre-commit framework in combination with this hook: https://github.com/kynan/nbstripout

Alternative approaches:

Related readings:

ccamara commented 10 months ago

Try with pre-commit and add this hook: https://github.com/kynan/nbstripout

ccamara commented 1 month ago

@eshasadia , you may want to check this when you have some time.

ccamara commented 1 month ago

I believe this would address the issue.

I have added instructions on the readme file:

This handbook relies on jupyter notebooks. Quarto renders any *.ipynb file into a handbook, and displays the output of any code block, according to the settings. Regretfully, that means that it executes every code cell and therefore, jupyter notebooks stores the results in the notebook too, which is not what we'd like to do.

To prevent executed cells from being pushed to the repo in an automated way, the following command must be run once within the repository's root:

nbstripout --install

This will setup a git filter and is only needed once. Any notebooks being committed to the repo will be striped out from executed cells.