Design document

Goal

Goal here is to describe and finish designing the mechanism to "rebase" changes.

When user A tries to commit a change, currently the commit will fail if user B committed since A's session started. This is the best and safest default, but it's not necessarily what A wants every time. For example maybe A wrote to array /array_a and B wrote to /array_b and those changes are unrelated. In a case like that, A may decide to still do the commit, accepting the risks if they know exactly what B changes were.

A rebase is then, the process of "merging" a change, potentially modifying it, on top of other pre-existing changes.

We want to provide:

A mechanism for users to execute a rebase after a failed commit.
Users can define what changes are OK to rebase and which are not, and how their changes must be modified for a clean rebase. Example: if user wrote to an array but a previous commit deleted that array, the user may indicate to either fail their commit, or to simply rebase ignoring any writes to the array.
If a rebase fails we need to explain why.

Transaction logs

As part of this change we will introduce the concept of TransactionLog. These are files we will store on-disk, in their own prefix, and with the same id as the corresponding snapshot. The transaction log contains a serialization, somewhat expanded, of the ChangeSet.

They provide at least two utilities:

An easy way to know what the conflicting commits changed, to be able to execute rebases without having to compare snapshots (it would be very expensive).
In the future, an easy way to provide diff functionality.

Transaction logs will be generated from the ChangeSet (and probably a bit of extra information, like the list of existing nodes), and they will be written during the commit process.

Transaction logs can be made optional. For ultimate performance users may choose not to use them, but in that case, they'll be giving up on rebase and diff functionality.

Conflict resolution

In the most detailed case, conflict resolution could be done interactively. Users may want to investigate their own change, together with the diffs of the conflicting changes, and decide with full detail how to modify their change for the rebase. This sounds like a very advanced usage, and we don't need to support it initially. We just need to make sure it is possible in the future.

In the simpler case, the user will run rebase after a commit failed with conflict. They will call a rebase function, passing a ConflictSolver that includes the policy on how to deal with different types of conflicts.

Some conflict resolution examples

If two changes write to the same chunk, user can select ours or theirs
If writes happen to an entity deleted in a previous change, we may support: ignore write or fail the rebase
TODO: more

Exhaustive list of conflicts and resolutions

This is WIP

When a previous change deleted an array:
- if chunks were written to it: recoverable by not applying the change
- if user attributes were set: recoverable by not applying the change
- if metadata was changed: recoverable by not applying the change
When a previous change deleted a group:
- if user attributes were set on it: recoverable by not applying the change
- if a new array is created inside of it: recoverable by re creating the implicit group
When a previous change creates an array
- if a node is created on the same path: recoverable by not applying the change
- if an implicit group is created on the same path: recoverable by not applying the change
When a previous change creates a group
- if a node is created on the same path, except if it's implicit
When a previous change updates user attributes
- if the same node attributes are also updated
- if the same node attributes are also updated
When a previous change updates zarr metadata
When a previous change writes/deletes a chunk

earth-mover / icechunk

Conflict detection and rebase #374