AbsaOSS / spline

Data Lineage Tracking And Visualization Solution
https://absaoss.github.io/spline/
Apache License 2.0
604 stars 155 forks source link

Is there any jar that can be attached to databricks cluster instead of adding dependencies in POM #682

Closed uday1409 closed 3 years ago

uday1409 commented 4 years ago

Background [Optional]

A clear explanation of the reason for raising the question. This gives us a better understanding of your use cases and how we might accommodate them.

Question

A clear and concise inquiry

uday1409 commented 4 years ago

Also, in rest gateway, we have added below for connection string as we were not sure of deploying through argument.

<Environment name="spline/database/connectionUrl" type="java.lang.String" value="arangodb://root:@localhost:8529/spline" override="true" />
wajda commented 4 years ago

Absolutely! https://absaoss.github.io/spline/#tldr

wajda commented 4 years ago

https://search.maven.org/search?q=g:za.co.absa.spline.agent.spark The bundles are exactly what you are looking for. They are fat JARs containing all Spline agent dependencies and are pre-built for different Spark and Scala versions. You can include it in the submit command or put it directly into the /jars folder of your Spark distribution

wajda commented 4 years ago

Also, in rest gateway, we have added below for connection string as we were not sure of deploying through argument.

<Environment name="spline/database/connectionUrl" type="java.lang.String" value="arangodb://root:@localhost:8529/spline" override="true" />

looks correct

uday1409 commented 4 years ago

thanks a lot @wajda for quick response as always. We are doing a POC on this.. Approach 1) In Azure VM, install ArangoDB and create scripts using admin jar, and deploy rest api gateway using war file. Configured connection string in context.xml file 2)Attatch fat jar to the cluster, and set configs such as query listener and producer api(using VM ip address), call the method (lineagetracking) in notebook

Please let me know if you see issue with this approach

wajda commented 4 years ago

If you are doing it centrally and setup "listener" property for Spark, there is no need to call "enableLineageTracking()" method. It's either either.

uday1409 commented 4 years ago

Thanks for the clarification @wajda . Really helps. I hope rest all , we are doing it right.

wajda commented 4 years ago

Basically, by setting up spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener you enable lineage tracking for the entire cluster and all jobs.

uday1409 commented 4 years ago

https://search.maven.org/search?q=g:za.co.absa.spline.agent.spark The bundles are exactly what you are looking for. They are fat JARs containing all Spline agent dependencies and are pre-built for different Spark and Scala versions. You can include it in the submit command or put it directly into the /jars folder of your Spark distribution

Do I need to attach all 3 jars to cluster or only spline agent bundle?

wajda commented 4 years ago

only one bundle that corresponds the Spark version and Scala version inn use.