Dataset Provenance View

okennedy commented 4 years ago

Several folks have now requested something to highlight the evolution / provenance of a given dataset. The details would have to be discussed, but at a minimum, this would contain (for a given target dataset):

Which datasets the target was derived from.
References to cells used to transform/modify the dataset
Indications of, at least the number of caveats introduced at each stage.
User-provided annotations on the different steps (e.g., a description of the dataset)

One way to accomplish this would be through a subway diagram, analogous to the ones we had in the Mimir UI several years back (Region 3 below) Mimir-Overview-1038x576

okennedy commented 4 years ago

Leaving aside the UX component of this (for discussion), the mechanisms required to compute which workflow steps are largely there. Dataset dependencies come up from the Vizier notebook, and we might be able to get finer grained dependencies for individual cells/records if there are no intervening tables. I'll start thinking through how the backend for this might be implemented.

okennedy commented 3 years ago

There's been a bit of offline discussion about how to implement this. There's a prototype flow chart at

vizier-db/api/projects/{projectId:int}/branches/{branchId:int}/head/graph

However, this is really tricky to follow.

Some lighter-weight options would be:

Look at the provenance of a specific dataset
Highlight dependencies in the ToC

okennedy commented 1 year ago

This issue has sat in the background for a long time, and there have been more discussions about how to pull this off related to the new UI (e.g., #65 ). The current thinking seems to be that we want a few things:

[x] Extend the workflow view so that hovering the mouse over an artifact highlights the cell in which the artifact was created
[x] Extend the Table of Contents view so that hovering the mouse over an artifact highlights the cell in which the artifact was created
[ ] Extend the Table of Contents (and possibly workflow?) view so that hovering the mouse over an artifact highlights the cell(s) used to generate the artifact (its ancestors in the provenance graph)
[ ] Extend the Table of Contents (and possibly workflow?) view so that hovering the mouse over an artifact highlights the cell(s) that read the artifact and their transitive dependencies.
[ ] Add a subway diagram focused on one specific artifact, rather than the full-project graph here in the prototype up above.

okennedy commented 1 year ago

With respect to the subway diagram, a good place to put it might be either the Artifact Inspector or the Table of Contents.

The data flow graph itself can be computed from the Workflow's modules or moduleViewsWithEdits field (the former is the actual modules/cells in the workflow, the latter also includes debuggers or cells that haven't been formally created yet). Specifically, note that both are effectively lists of WorkflowElement. Each WorkflowElement includes an outputs field (a reactive object listing all of the artifacts emitted by the node). In principle it should also include an inputs field... I'll work on that tomorrow.

okennedy commented 1 year ago

Probably the big problem that needs to be solved is building the dependency graph. The information is (mostly) present, but the dependency graph itself would need to be generated. There's sort of an example of this already in https://github.com/VizierDB/vizier-scala/blob/v2.0/vizier/ui/src/info/vizierdb/ui/components/WorkflowElement.scala#L57 which works by:

Node N takes the artifacts it has in its input (visibleArtifacts), and figures out the artifacts it updated/deleted (visibleArtifactsAfterSelf), and then updates the list of artifafcts available at node N+1 (next.foreach { replaceArtifacts }; noting that foreach on a 'Option' is basically a "do this if the Option is a Some").

However, this doesn't look at dependencies. WorkflowElement could be modified to include some sort of dependency-building functionality, or we could add an entirely new module to conduct dependency analysis.

VizierDB / vizier-scala

Dataset Provenance View #27