A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.
Figure 1:Run-level graph relationships between dataset versions, job versions, and runs.
Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a RUNNING state to a COMPLETED state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.
Introduction
A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.
Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a STRING upstream, though the failing job was processing the column as an INT downstream.
Graph Data Model
A run-level graph consists of the following nodes:
Dataset Version: A read-only immutable version of a dataset.
Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.
First, we create the run a03422cf for orders_popular_day_of_week that consumes the input version 695888e2 and produces the output version a03422cf:
Figure 2:
Run ec6abf8b
Then, we create another run ec6abf8b that consumes the same input version 695888e2, but produces a new output version ec44fed4:
Figure 3:
Run diff from a03422cf to ec6abf8b
A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node A, up to a given run node B (inclusive). Below we show a run-based comparison for the job orders_popular_day_of_week between runs a03422cf and ec6abf8b:
Run-level Graph
A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.
Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a
RUNNING
state to aCOMPLETED
state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.Introduction
A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.
Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a
STRING
upstream, though the failing job was processing the column as anINT
downstream.Graph Data Model
A run-level graph consists of the following nodes:
Nodes
dataset:{namespace}:{dataset}#{version}
dataset:food_delivery:public.top_delivery_times#947c0388..
job:{namespace}:{job}#{version}
job:food_delivery:orders_popular_day_of_week#947c0388..
run:{id}
run:a03422cf..
Edges
dataset:*
,TO
,run:*
}run:*
,TO
,dataset:*
}run:*
,IS_VERSION_OF
,job:*
}Example
Run
a03422cf
First, we create the run
a03422cf
fororders_popular_day_of_week
that consumes the input version695888e2
and produces the output versiona03422cf
:Run
ec6abf8b
Then, we create another run
ec6abf8b
that consumes the same input version695888e2
, but produces a new output versionec44fed4
:Run diff from
a03422cf
toec6abf8b
A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node
A
, up to a given run nodeB
(inclusive). Below we show a run-based comparison for the joborders_popular_day_of_week
between runsa03422cf
andec6abf8b
: