earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
283 stars 16 forks source link

Add `git status` functionality #309

Open paraseba opened 3 weeks ago

dcherian commented 1 week ago

Design Doc

Goal

Our goal is to have .status() surface useful information to the user. Since status is inherently backward-looking, I expect users will approach it with four questions in mind:

  1. Where did I start?
  2. What did I do?
    • Did I create a new group?
    • Did I delete an existing?
    • Did I update an existing group?
      • Did I update attributes of this group?
      • Did I create a new array?
      • Did I delete an array?
      • Did I modify existing arrays?
      • Did I modify attributes?
      • Are these chunk modifications?
        • did I write new chunks?
        • did I delete existing chunks?
        • did I overwrite existing chunks?
      • How did I modify the chunks (if relevant)?
        • What fraction of the array was modified?
        • Did I append?
        • Did I write a region?
  3. Wait, I didn't know I did that?
    • We should let the user interrogate the changeset to see what exact changes were made. E.g. you added an attribute to an array at this path.
  4. What do I do next?
    • Commit!

Information to surface

  1. For Q1, we should surface VCS history information: a. Repo bucket b. Base Snapshot ID and commit time. c. Do we need to surface a truncated commit message too? d. Current branch

  2. For Q2, we should surface information in the current ChangeSet:

    • new_groups, new_arrays
    • deleted_groups, deleted_arrays
    • updated_arrays -> zarr array metadata
    • updated_attributes -> group/array user attributes
    • set_chunks -> modified chunks (create/delete/overwrite)
  3. For Q3, we punt to later :)

  4. For Q4, we should encourage the user to make a commit by saying loud and clear that these are "uncommitted changes" and will be lost (?).

How to surface

As a tree?

One particularly neat way to surface this information would be to show the hierarchy as tree. We could construct a tree for the snapshot + changeset: new-tree. Then iterate through the changeset and add information to the appropriate node of the tree. During diffing, if two nodes are in the same place, we examine the changeset for metadata, chunk modifications and annotate the new-tree appropriately. The diffed-tree structure could then be rendered to text in Rust, and transferred to Python for rendering with rich (for example)

As plain text?

We could simply output formatted lists of created/updated/deleted groups/arrays.

Misc

Some things to be careful about:

  1. Initial commit should not be confusing to the user.
  2. Comparing two different status outputs should be easy
    1. Information should be easy to scan, sorting is a good idea.
paraseba commented 1 week ago

I'll add another option to "How to surface":

That could help us in the future have an answer to 3.

Example:

struct Status(Vec<(Path, Vec[Change])>)

enum Change {
   ArrayCreated(ZarrMetadata),
   ArrayMetadataUpdated(ZarrMetadata),
   ChunksDeleted(Vec<ChunkIndices>),
   ChunksWritten(Vec<ChunkIndices>),
   GroupDeleted,
    ....
}

This would give us a high level language. Then we can build different types of formaters on top of it.

paraseba commented 1 week ago

@dcherian I realize my proposal is very similar to work I'll have to do for transaction log support. That structured status looks a lot like a transaction log. Maybe we shouldn't try to do both things in parallel, but we can discuss more.

dcherian commented 1 week ago

Ah nice, yes that makes sense.

I'm also thinking the formatting of the structure should live in Rust, and we may want to expose it through a CLI in the future?

rabernat commented 1 week ago

We should probably design both status and transaction log as abstract data structures and then build ways to transform them into something for display in different contexts, e.g.