Spline agent affecting databricks driver performance

ganeshnikumbh commented 1 year ago

Hi @wajda, @cerveada

We are using spline agent with databricks and sending lineage by http requests using the httpsdispatcher . We are using Azure function to collect the lineage. What we saw was, during high loads (and so high response times) on the function, if the agent is not able to establish connection to the gateway, it continues to retry every 2 mins. But during this time all operations on the cluster was hanged. I am attaching the logs here for your reference. We had to remove spline installation and restart the cluster to make it normal. Though we are working on improving the Azure function response time by correctly sizing it, but we want to know if we can do anything in the spline setting as well to stop retries if once the gate connection is failed. We plan to install spline on 100 clusters and do not want to lose business team's trust. Please Help!

log4j-2023-09-12-08 (1).log

cerveada commented 1 year ago

I don't know who is doing the retry, but the Agent does not. It just initializes and try to connect to the endpoint and then fails, that's it. Something else is then running the whole job again, I guess?

You can disable the connection check at the http dispatcher initialization, but if the endpoint is not available when the lineage is supposed to be sent, it will still fail then.

What version of Databricks and Spark this runs on?

wajda commented 1 year ago

We plan to install spline on 100 clusters and do not want to lose business team's trust. Please Help!

On production, you definitely want to decouple your main Spark jobs from any secondary dependencies. We recommend to use any resilient messaging system for this purpose. Spline Agent comes with the embedded KafkaDispatcher for example. Alternatively, you can setup a highly available HTTP gateway (maybe Azur function), that would accept connections from Spline and send it to a messaging system to decouple it from a potentially expensive/unstable further processing of the lineage metadata. Such technique would allow your Spark jobs send the lineage info and carry on with its main job, making the whole system more robust.

We had to remove spline installation and restart the cluster to make it norma

To temporarily disable Spline Agent you can simply set the property spline.mode=DISABLED. No need to actually uninstall it.

ganeshnikumbh commented 1 year ago

I don't know who is doing the retry, but the Agent does not. It just initializes and try to connect to the endpoint and then fails, that's it. Something else is then running the whole job again, I guess?

You can disable the connection check at the http dispatcher initialization, but if the endpoint is not available when the lineage is supposed to be sent, it will still fail then.

What version of Databricks and Spark this runs on?

Hi @cerveada , @wajda we are using different DBR like 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12), 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) and 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12). And use the spline agent version according to spark version.

We also see that, even in normal scenario (no load on azure function), when the cluster starts, spline initialization happens 2 times. Pls see attached logs at the time of cluster start. You will see the "Spark Lineage tracking is ENABLED" message 2 times, one at "23/10/05 07:48:21" and another at "23/10/05 07:48:37". Any idea why it is trying to enable itself two times. log4j-clusterStart.txt

cerveada commented 1 year ago

Could you try to use programmatic initialization instead of codeless? https://github.com/AbsaOSS/spline-spark-agent#initialization

According to this guide, there were issues with codeless init: https://github.com/AbsaOSS/spline-getting-started/tree/main/spline-on-databricks

wajda commented 1 year ago

The init type is codeless, it's visible from the logs. Also, from what I can see, there must have been two independent spark sessions or even contexts creating. I don't know why this is happening, but it has nothing to do with Spline. Spline agent is just a Spark listener registered via the Spark public API, that's it. Spline agent listener doesn't contain any shared state, so if for some reason Spark driver decides to create two instances of the same listener there should be no impact (though we didn't test this scenario as normally this doesn't happen and listeners are shared between sessions). In other words, I don't know why agent is double initialised in your setup, but it hardly creates further issues by itself, you should get lineage normally. Try to switch the dispatcher from http to console or logging to remove dependency on your Azur function and see if it makes any difference. If it works and you see lineage JSON in logs then the issue is definitely in your Azure function.

ganeshnikumbh commented 1 year ago

Sorry to bother with this again, but receiving lineage not an issue even with Azure function and we are receiving lineage fine. Only concern we had was the spline initializing 2 times at the start of cluster and when we had the function response issue, the agent goes in loop to connect even if it failed trying to connect the first time. Appreciate if you can check this when you get some time.

wajda commented 1 year ago

As I tried to explain above, the only reason I see for multiple Spline inits is that there are multiple Spark inits. The Spark session might be repeatedly timing out and something re-runs your Spark job. Otherwise I cannot explain it. Try to enable DEBUG or even TRACE log level and see what's happening.

AbsaOSS / spline-spark-agent

Spline agent affecting databricks driver performance #747