Open lucienfregosi opened 6 years ago
Hi Lucien, We are currently enhancing Spline to also support Structured Streaming. This feature will come with the Spline version 0.3.
Regards, Marek Novotny
On Wed, Feb 14, 2018 at 11:17 AM, Lucien Fregosi notifications@github.com wrote:
Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgR_QjvZzXZ2-7s9S3lDAsjIW4qH5Qfks5tUrLWgaJpZM4SFE7m .
Perfect :)
I'm writing a blog post about Spline after testing it (in french first, maybe in english later) i will be able to provide this information in my post.
@lucienfregosi Hi, we have some basic support in version 0.3 but disabled at the moment. I will be working on full support including structured streaming now as highest priority. Deadline will be end of August.
@lucienfregosi we will not support the old streaming using RDDs at the moment. Any issues for u to switch to structured streaming instead which will be supported? It seems to be treated as successor of old streaming.
We are withdrawing streaming support from Spline 0.4 as it was not implemented properly. Streaming is not a priority for us at the moment. We'll return to it later.
A test case - AbsaOSS/spline#331
Hi @wajda , May I confirm that the Structured Streaming is not supported such as writeStream API? Thanks
No, streaming is not supported due to fundamental problems with the definition and representation of data lineage in context of streaming. The topic remains unclear.
Hi @wajda No problem, thanks for the confirmation.
Hello Everyone, We have been investigating spline and spark structured streaming. We have been able to implement spline-agent for spark structured streaming using spark’s StreamingQueryListener, in a similar way as is described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (9:02 - 11:23). Code for our POC can be found here: https://github.com/jozefbakus/spline-spark-agent/pull/1
Along the way we came across one major problem, linking. Linking in terms of connecting streaming parent-child lineages. Currently, time linking is used: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (18:33 - 20:14). Time linking is not sufficient for streaming jobs. We are trying to find a suitable type of linking for streaming jobs. One of the solutions might be using kafka offsets similar way as described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (20:14 - 22:15).
To be able to link parent-child lineages, source and destination offsets (read and write offsets) are required. Spark gives us source offsets out of the box, the problem lies in destination offsets. Spark does not provide information about what offsets data was written to. Getting destination offsets in a nice, pluggable way is our current issue that we are trying to resolve before we can move forward.
Using read/write offsets linking might not be the only way, so we are also investigating different types of lineage linking.
The Spark Streaming support has been deprioritized, so I'm removing this feature from the active backlog.
Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data