AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
176 stars 90 forks source link

Capturing Lineage when Data is not Written #132

Open jcnorman48 opened 3 years ago

jcnorman48 commented 3 years ago

I have a question on capturing lineage for operations that are not Written, but reads are still performed.

Take for example this code: result = spark.sql("select * from hive.table limit 1").collect()

Has a materialized dataframe, which could be used to parameterize subsequent queries, or produce reports or datasets that may be written to resources outside spark, but I don't believe lineage is captured via Spline.

We have quite a few use cases where we want to capture lineage in this scenario and want to gauge what it may take for these lineage records to flow to the LineageDispatcher. We are using a custom LineageDispatcher for our case as well. Additionally we will be adding additional "Meta Information" to these lineage records so that they still have "Attribution" to the Job(code) or User who performed the read or write.

Thanks, I understand why Spline doesn't capture Lineage without a write, as it's a Dataset to Dataset lineage tool, but want to start a discussion to understand what may be needed to enable this, or if it's realistic in general.

Thanks for the advice.

wajda commented 3 years ago

Hi @jcnorman48, thanks for a good question. I honestly tried to answer it in a few words, but failed to do so :) The problem you've described looks a bit like an XY-problem to me.

Well, the topic can be approached from different sides. The answer to your question depends on what you are trying to capture - the data lineage or the execution plan of the job. Technically speaking you can capture an execution plan from virtually any Spark action (as long as there is a listener for that type action in Spark). And yes, Spline agent for Spark can capture it. However, Spline is a data lineage tracking tool. To answer the question what difference does it make (if any) we first need to talk briefly about the data lineage concept lying behind Spline. Data Lineage is a complex thing that is generally defined as a cohesion of three mutually orthogonal abstractions:

We call them What, How and When respectively. Every term describes a different aspect of things independently from the other two. And all three added up together form a Data lineage.

Data origin is a static term that describes just that - data location (e.g. a file, a table in a database, a topic in a messaging system etc). It could be anything that can be unambiguously identified in a given information system. The notion of data origin plays a central role in the data lineage definition as it refers to the data whose lineage we are tracking.

So, back to your example - result = spark.sql("select * from hive.table limit 1").collect(). If I got it right you said that result is a

dataframe, which could be used to ... produce reports or ... be written to resources outside spark

So by that I understand that the result can be seen as a part of Transformation of the data flow between the two Data origins - hive.table and reports or resources outside Spark. In this case the latter is your target data origin and a write event you might be interested in is report generation rather than .collect() call. So to me, looking from this perspective the original problem transforms from "how to track lineage when data isn't written" to "how to capture the missing part of the lineage outside Spark".

If that's what you are really after then we are on the same track, and Spline has something to offer. Otherwise tracking the data that have only been read and abandoned without knowledge where it's landed (if at all) doesn't quite match Spline goal or vision.

wajda commented 2 years ago

related to #361, #181, #33