VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Dataset Provenance View #27

Open okennedy opened 4 years ago

okennedy commented 4 years ago

Several folks have now requested something to highlight the evolution / provenance of a given dataset. The details would have to be discussed, but at a minimum, this would contain (for a given target dataset):

One way to accomplish this would be through a subway diagram, analogous to the ones we had in the Mimir UI several years back (Region 3 below) Mimir-Overview-1038x576

okennedy commented 4 years ago

Leaving aside the UX component of this (for discussion), the mechanisms required to compute which workflow steps are largely there. Dataset dependencies come up from the Vizier notebook, and we might be able to get finer grained dependencies for individual cells/records if there are no intervening tables. I'll start thinking through how the backend for this might be implemented.

okennedy commented 3 years ago

There's been a bit of offline discussion about how to implement this. There's a prototype flow chart at

vizier-db/api/projects/{projectId:int}/branches/{branchId:int}/head/graph

However, this is really tricky to follow.

Some lighter-weight options would be:

okennedy commented 1 year ago

This issue has sat in the background for a long time, and there have been more discussions about how to pull this off related to the new UI (e.g., #65 ). The current thinking seems to be that we want a few things:

okennedy commented 1 year ago

With respect to the subway diagram, a good place to put it might be either the Artifact Inspector or the Table of Contents.

The data flow graph itself can be computed from the Workflow's modules or moduleViewsWithEdits field (the former is the actual modules/cells in the workflow, the latter also includes debuggers or cells that haven't been formally created yet). Specifically, note that both are effectively lists of WorkflowElement. Each WorkflowElement includes an outputs field (a reactive object listing all of the artifacts emitted by the node). In principle it should also include an inputs field... I'll work on that tomorrow.

okennedy commented 1 year ago

Probably the big problem that needs to be solved is building the dependency graph. The information is (mostly) present, but the dependency graph itself would need to be generated. There's sort of an example of this already in https://github.com/VizierDB/vizier-scala/blob/v2.0/vizier/ui/src/info/vizierdb/ui/components/WorkflowElement.scala#L57 which works by:

Node N takes the artifacts it has in its input (visibleArtifacts), and figures out the artifacts it updated/deleted (visibleArtifactsAfterSelf), and then updates the list of artifafcts available at node N+1 (next.foreach { replaceArtifacts }; noting that foreach on a 'Option' is basically a "do this if the Option is a Some").

However, this doesn't look at dependencies. WorkflowElement could be modified to include some sort of dependency-building functionality, or we could add an entirely new module to conduct dependency analysis.