LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
684 stars 143 forks source link

SparkMeasure isn't working with Databricks Unity Catalog and when using Spark Connect in general #62

Open JonatanTorres opened 1 month ago

JonatanTorres commented 1 month ago

I am trying to use SparkMeasure on Databricks, but unfortunately, it is not working when the Cluster is on Unity Catalog (Runtime 14.3 LTS).

When running the following code:

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)

It returns this error:

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `sparkContext` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.
File <command-4176627973341233>, line 1
----> 1 stagemetrics = StageMetrics(spark)

I have already tried some configurations in the session, but nothing worked. When creating a cluster in the same runtime (14.3 LTS) but without Unity Catalog, the code works normally.

Is there any way to solve this? Thank you a lot!

LucaCanali commented 4 weeks ago

Hi @JonatanTorres thank you for reporting this. SparkMeasure relies on the metrics system "transported" via the Spark Listener from the server side to the client side. This does not work with Spark Connect, which decouples the client side from server side. I am not aware of a solution to this issue, right now, but it's something I'd like to further investigate.

JonatanTorres commented 4 weeks ago

Thank you very much for the response, Luca! I will also try to ask Databricks directly!