AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 94 forks source link

Spark Streaming Support #41

Open lucienfregosi opened 6 years ago

lucienfregosi commented 6 years ago

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

mn-mikke commented 6 years ago

Hi Lucien, We are currently enhancing Spline to also support Structured Streaming. This feature will come with the Spline version 0.3.

Regards, Marek Novotny

On Wed, Feb 14, 2018 at 11:17 AM, Lucien Fregosi notifications@github.com wrote:

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgR_QjvZzXZ2-7s9S3lDAsjIW4qH5Qfks5tUrLWgaJpZM4SFE7m .

lucienfregosi commented 6 years ago

Perfect :)

I'm writing a blog post about Spline after testing it (in french first, maybe in english later) i will be able to provide this information in my post.

vackosar commented 6 years ago

@lucienfregosi Hi, we have some basic support in version 0.3 but disabled at the moment. I will be working on full support including structured streaming now as highest priority. Deadline will be end of August.

vackosar commented 6 years ago

@lucienfregosi we will not support the old streaming using RDDs at the moment. Any issues for u to switch to structured streaming instead which will be supported? It seems to be treated as successor of old streaming.

vackosar commented 5 years ago
wajda commented 5 years ago

We are withdrawing streaming support from Spline 0.4 as it was not implemented properly. Streaming is not a priority for us at the moment. We'll return to it later.

wajda commented 5 years ago

A test case - AbsaOSS/spline#331

NickDudu commented 2 years ago

Hi @wajda , May I confirm that the Structured Streaming is not supported such as writeStream API? Thanks

wajda commented 2 years ago

No, streaming is not supported due to fundamental problems with the definition and representation of data lineage in context of streaming. The topic remains unclear.

NickDudu commented 2 years ago

Hi @wajda No problem, thanks for the confirmation.

jozefbakus commented 2 years ago

Hello Everyone, We have been investigating spline and spark structured streaming. We have been able to implement spline-agent for spark structured streaming using spark’s StreamingQueryListener, in a similar way as is described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (9:02 - 11:23). Code for our POC can be found here: https://github.com/jozefbakus/spline-spark-agent/pull/1

Along the way we came across one major problem, linking. Linking in terms of connecting streaming parent-child lineages. Currently, time linking is used: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (18:33 - 20:14). Time linking is not sufficient for streaming jobs. We are trying to find a suitable type of linking for streaming jobs. One of the solutions might be using kafka offsets similar way as described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (20:14 - 22:15).

To be able to link parent-child lineages, source and destination offsets (read and write offsets) are required. Spark gives us source offsets out of the box, the problem lies in destination offsets. Spark does not provide information about what offsets data was written to. Getting destination offsets in a nice, pluggable way is our current issue that we are trying to resolve before we can move forward.

Using read/write offsets linking might not be the only way, so we are also investigating different types of lineage linking.

wajda commented 1 year ago

The Spark Streaming support has been deprioritized, so I'm removing this feature from the active backlog.