bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
53 stars 48 forks source link

Should we store revisions as diffs? #655

Open danielballan opened 5 months ago

danielballan commented 5 months ago

In the Tiled Catalog SQL database, we store a history of metadata revisions by keeping a snapshot of the previous versions of metadata in a separate table, metadata_revisions. In an aside with @Kezzsim, @jmaruland, @gwbischof, and @padraic-shafer, the question was raised, "Should we store the history of metadata revisions in the Tiled Catalog SQL database as RFC 6902 JSON patches?"

In a DUTC training, @dutc asked "Does git store the repo history in .git/ as diffs or as snapshots of the files?" I got the answer wrong. (IIRC most of us did.) Git in fact stores snapshots, not the diff.

Git can produce a diff-compressed representation, which is useful for transmitting the data with maximum efficiency. But the canonical storage at rest is the plain contents of each object (~file). I have read that this choice was made because the state of the art for diffing algorithms is a moving target on time scales that git cares about, and Linus did not want to make any particular diff representation core to git's data model.

In our use case, we know that we always have JSON---not arbitrary text/data like git has---and I think we would be on safe ground betting on RFC 6902 as a durable standard.

Considerations....

I genuinely don't have a good feel for this question yet. When it comes to databases and long-lived state, making "boring" simple choices is good, but if diffs can be used in a solid, obvious way that buys scaling wins, it's worth considering.

I am interested in broad input.

padraic-shafer commented 5 months ago

I wonder if this really belongs in an audit log or event stream outside of Tiled. If so, then maybe the piece for Tiled to focus on would be emitting log entries with enough info to enable that functionality. Of course, visibility of data then becomes a potential concern.

Kezzsim commented 4 months ago

This has become increasingly relevant in a discussion with Dan, Yugene and Seher at DSSI for flyscanning operations as part of the Diamond bluesky Flycanning collaboration. There will need to be a native "append" feature added to Tiled, which naturally will create new revisions. If each revision is a copy of the previous object then the collection will grow very rapidly.

Storing only the diff would mean each revision just has the newly appended content from that event.