VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Make Vizier GIT-friendly #317

Open okennedy opened 3 months ago

okennedy commented 3 months ago

What pain point is this feature intended to address? Please describe. At present, there's no (easy) way to share Vizier database files. The current state of the art is to zip up the vizier.db directory, or to use the 'export' feature. Neither is particularly amenable to collaborative development.

Describe the solution you'd like Fundamentally, it would be nice to have a way to drop a vizier.db folder into a VCS. The limiting factors to this at the moment are:

  1. The Vizier.db SQLite database can get quite large. We did add a GC/Dedup feature that should keep it more in check, but even so, it is not unlikely that a database will eventually exceed the file size limit of public VCS hosts like GitHub.
  2. The Vizier.db SQLite database is updated on every edit, and is a binary file. This means that, for all practical purposes, the SQLite database can't be delta'd and ends up getting pushed in its entirety on every edit.
  3. The SQLite database can't (easily) be diffed, not only because it is a binary file, but even a logical diff of the database would need to take into account semantic considerations, like key identifiers (which could conflict if two people add workflows/etc... in parallel), and foreign key relationships (e.g., filenames that need to correspond to artifact identifiers). This means any conflict requires the user to take extensive, error-prone manual resolution steps.
  4. It's possible that file artifacts could exceed the size cap of a VCS system; GIT-LFS support should be included.

Describe alternatives you've considered

okennedy commented 3 months ago

Vizier already supports computing deltas of workflows. One approach that hits bullets 1-3 might be:

  1. Add a log directory to vizier.db
  2. Add a .gitignore to vizier.db that explicitly ignores Vizier.db (maybe this means we can move the cache directory here too!)
  3. Automatically log updates (more/less the delta bus) to a logfile. We could open a new logfile (marked with a timestamp for chronological integration) or use something like GIT's hash-based versioning. Treat this logfile as the canonical system state.
  4. On launch (and maybe while running) detect the presence of new logfiles in the log directory and patch the database accordingly.