MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.77k stars 315 forks source link

Run-level Graph #2772

Open wslulciuc opened 7 months ago

wslulciuc commented 7 months ago

Run-level Graph

A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.

run-node

Figure 1: Run-level graph relationships between dataset versions, job versions, and runs.

Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a RUNNING state to a COMPLETED state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.

Introduction

A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.

Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a STRING upstream, though the failing job was processing the column as an INT downstream.

Graph Data Model

A run-level graph consists of the following nodes:

Nodes

ID dataset:{namespace}:{dataset}#{version}
Example dataset:food_delivery:public.top_delivery_times#947c0388..
ID job:{namespace}:{job}#{version}
Example job:food_delivery:orders_popular_day_of_week#947c0388..
ID run:{id}
Example run:a03422cf..

Edges

Example

Run a03422cf

First, we create the run a03422cf for orders_popular_day_of_week that consumes the input version 695888e2 and produces the output version a03422cf:

run-1

Figure 2:

Run ec6abf8b

Then, we create another run ec6abf8b that consumes the same input version 695888e2, but produces a new output version ec44fed4:

run-2

Figure 3:

Run diff from a03422cf to ec6abf8b

A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node A, up to a given run node B (inclusive). Below we show a run-based comparison for the job orders_popular_day_of_week between runs a03422cf and ec6abf8b:

diff

Figure 4 Diff from a03422cf to ec6abf8b

zqqqqz2000 commented 7 months ago

Very nice feature, looking forward to its completion. Is there currently a schedule for completion?