AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 95 forks source link

Support of setting the arangoDB name on the configuration #772

Open zacayd opened 11 months ago

zacayd commented 11 months ago

Hi I am using spline to capture lineage from Databricks notebooks I put on the cluseter - on the advanced settings

spark.spline.mode ENABLED
spark.spline.lineageDispatcher.http.producer.url http://10.0.19.4:8080/producer
spark.spline.lineageDispatcher http

since i have several customers- i dont want to keep the data of all of them on the same arangoDB so I want a way that the response will be kept on a db per customer.

can we send also the arangoDb name as a parameter so the execution plan lineage data will be kept on a different db for each cluster i use

thanks in advance

wajda commented 11 months ago

No, this isn't possible. The database is an internal part of the system and is not something you can easily select on a request basis.

My recommendation for your use-case would be to simply augment your execution plan and event objects with the DBR cluster name stored as an extra parameter, or a tag, and filter the stuff on the UI based on that (the feature beta is available in the develop version of the server and the UI).

Alternatively, you may augment the URIs for the input/output sources to include the cluster name as a part of the name. That is another way to logically separate the lineage data.

If you absolutely want to use different DBs then you can run separate Spline instances, put a custom proxy gateway in front of the Spline Producer REST API (or implement a custom LineageDispatcher wrapper) and route your requests to different Spline instances based on your custom conditions.

zacayd commented 11 months ago

About DBR cluster name stored as an extra parameter, or a tag, and filter the stuff on the UI based on that (the feature beta is available in the develop version of the server and the UI).

Do you mean that the name of the cluster is on the execution plan?

zacayd commented 11 months ago

Does the feature Beta is available as a maven in the Databricks?

wajda commented 11 months ago

No. You need to build and install from the laters development branch.

zacayd commented 11 months ago

Any chance that it will be on the cloud of Databricks soon? since i have trouble to build and install it

wajda commented 11 months ago

no ETA unfortunately. The team has no capacity and the business priorities changed. So the project is on hold at the moment.

zacayd commented 10 months ago

Hi I succeeded to compile the project and create a Jar and load via the DBFS But seems that when i run the Notebook - i get lineage data but the info of the notebook is missing i took the branch of develop https://github.com/AbsaOSS/spline-spark-agent/tree/develop can you advise?