capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
472 stars 125 forks source link

SparkCompare fails on Databricks DBR Spark clusters with Unity Catalog enabled #312

Closed lingeshr-db closed 3 months ago

lingeshr-db commented 3 months ago

We encountered an issue when using SparkCompare (datacompy-0.12.0) against tables registered with Unity Catalog on a Databricks DBR 13.3 (or any version >13.3) cluster.

The error message is:

py4j.security.Py4JSecurityException: Method public boolean org.apache.spark.sql.internal.CatalogImpl.dropTempView(java.lang.String) is not whitelisted on class class org.apache.spark.sql.internal.CatalogImpl

This is a known limitation when working with Unity Catalog registered tables in Databricks DBR versions, where not all functions are whitelisted specifically internal CatalogImpl APIs like dropTempView(). If a function is not whitelisted by default, it might be considered unsafe, meaning enforcing table ACLs or another mode of access control is not possible, and hence, it is blocked in the Unity Catalog.

As a workaround, the public API counterparts, such as spark.catalog.dropTempView(), should be used instead of the restricted internal APIs.

SparkCompare might have a dependency on more of such internal APIs and may not work (at least as of today) with Databricks Unity Catalog tables.

Please investigate this issue and consider updating SparkCompare to utilize the public API counterparts to ensure compatibility with Unity Catalog-enabled tables/environments. Thank you!

P.S:

  1. LegacySparkCompare() works fine
  2. SparkCompare() works fine for Hive Metastore registered tables in any Databricks DBR versions.
fdosani commented 3 months ago

Unfortunately I don't have access to Databricks. DataComPy is pretty agnostic with its Spark code. Assuming you are using the Pandas on Spark version it's all just vanilla code in that sense.

I have a PySpark SQL PR which is in review (#310) if you want to try that out.

fdosani commented 3 months ago

Just checked the docs: https://docs.databricks.com/en/compute/access-mode-limitations.html#udf-limitations-for-unity-catalog-shared-access-mode

In Databricks Runtime 13.3 LTS and above, Python scalar UDFs and Pandas UDFs are supported. Other Python UDFs, including UDAFs, UDTFs, and Pandas on Spark are not supported.

Your env wont support the Pandas on Spark API implementation. You need to use either Legacy or the new SparkSQLCompare once released (assuming that works ok).