fonsp commented 8 months ago

This is a good first issue, and requires no JavaScript knowledge!

We recently launched Safe preview (#2563, take a look), but it has one missing feature!

When you open a notebook without running it, and you make a small change (eg fix a typo in a cell), Pluto will save the file, but the order of cells in the file can be drastically different, and you get a big git diff. Try this with one of your more complex notebooks to see what I mean, and take a look at which cells get moved and why.

This is also an issue currently in some more niche areas:

Pluto.activate_notebook_environment and Pluto.update_notebook_environment will modify the embedded pkg info (yay) but also mess up the cell order (boo).

If you open a notebook and shut it down before it finished, you have the same problem.

The cause

Pluto saves cells in the .jl file in an order that allows the cells to be executed as a standalone script. It essentially takes the visual cell order, and then we do a stable sort based on cell dependencies.

The reason why the issue happens is that Pluto often needs to evaluate your code to fully understand cell dependencies. The main examples are

using Plots defines the symbol scatter
a macro can create or remove variable definitions and references, e.g. @gensym x looks like it references x, but it actually defines x

You cannot derive this info statically, so we run code.

When you open a notebook without running it, we only have information that can be derived statically. When we save your notebook, we do the best we can using that information, but this can lead to a very different file order than what was previously derived (based on complete information), leading to a large diff.

Solution idea

My idea for solving this is: when we load a notebook file, also remember the original order in which cells appear in the file. Whenever you save a notebook, check if we have enough information to fully derive the order. If so, remove the memorized order and do as usual. If not, try to stick to the original file order as much as possible (i.e. use the original file order as a way to disambiguate the unknown cell deps).

Relevant code is here:

Pluto's reactivity algorithm. The core is this function that sorts a list of cells into search order https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/src/analysis/topological_order.jl#L10-L137

The output of this function is: https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/src/analysis/TopologicalOrder.jl

One of the inputs of this function is: https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/src/analysis/Topology.jl#L28-L36 which is a snapshot of the cells, and their symbol definitions etc that are known at one point in time.

The Notebook struct: https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/src/notebook/Notebook.jl#L24-L65

Reading and writing notebook files: https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/src/notebook/saving%20and%20loading.jl

For tests, search for the word "order" in this file: https://github.com/fonsp/Pluto.jl/blob/v0.19.29-src/test/React.jl Also search for only_versions_or_lineorder_differ in the source code.

filchristou commented 8 months ago

I am quite unfamiliar with the code base, so maybe a that's long shot here.. Remembering the original order in which cells appear in the file means that this TopologicalOrder should be stored in the Notebook struct. That looks invasive, since a new struct field must be added.

I would personally strive for a solution to keep a single TopologicalOrder around. Can we make this choice directly during load_notebook ? So, trust the script *.jl cell orders to directly get the TopologicalOrder (e.g. in _cached_topological_order), without doing static_resolve_topology. A version of static_resolve_topology will serve later to incrementally update the TopologicalOrder if we don't have the means to evaluate code, but the notebook is modified.

I am also troubled a bit with

checking if we have enough information to fully derive the order.

How would you conduct this check ? Is Notebook.process_status reliable for such a thing ?

BioTurboNick commented 8 months ago

I'm poking at this a bit and also generally unfamiliar.

My naive thinking is:

Store initial order on load as the current topological order.
When saving while in safe mode, always use that order, removing cells if deleted or appending any new ones to the end.
When saving while not in safe mode and exiting safe mode, run notebook and use the computed order as normal.

Since this seems simple enough, I feel like I must be missing something?

fonsp commented 8 months ago

You mean storing the order as _cached_topological_order, right? That might work! Adding a new field would be cleaner though, the idea is that _cached_topological_order does not store any information on its own, it just caches the output of topological_order(nb.topology).

But feel free to try! Writing the tests without implementation would already be super helpful.

BioTurboNick commented 8 months ago

Is this understanding correct?

There's no valid "Topology" of a notebook until it has been saved.
There can only be one topology for a notebook at a time.
Topology is recalculated each time it is saved/run, overwrites the previous, and the file is reordered as needed without respect to the previous order.
In safe mode, there is no ability to calculate topology, so it is falling back to some naive ordering.

So, is there any fundamental reason why this could not be done easily?

On load, calculate an initial topology from the file ordering and store the same way any topology is stored.
On save, when in safe mode, reuse stored topology instead of recalculating. (With removed/appended cells as needed)

fonsp commented 8 months ago

Yes that sounds correct and like a good idea!

One thing to note is that generally, saving does not cause calculations, but calculations (like a new topology) are caused by code execution (and obviously, code edits). For example, when you exit safe mode, we execute all cells, which means that more information about the toplogy becomes available (like an executed using or macro definitions) and we recalculate the toplogy, and save again.

Feel free to start a draft PR! I think your approach sounds good. I'm curious to see what it looks like when the notebook.topology is used to store the "imported topology". I can imagine that a new field eg notebook.imported_topology will be cleaner but that would be a small tweak :) Thanks for taking the time to look into this!

BioTurboNick commented 7 months ago

Is there an example notebook that has an issue? Because in safe preview it seems like it's ordering cells appropriately in simple examples.

fonsp / Pluto.jl

Preserve cell order in Safe preview #2691

The cause

Solution idea

Relevant code is here: