G-Research / spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Apache License 2.0
193 stars 26 forks source link

Not able to use spark-extension package with Spark Connect server / Databricks 14.x runtime #246

Open shri-0509 opened 3 months ago

shri-0509 commented 3 months ago

here is the code

from gresearch.spark.diff import *
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
print(spark.version)
left.diff(right).show()
Spark version-3.5.0
Error: [[ATTRIBUTE_NOT_SUPPORTED](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#attribute_not_supported)] Attribute `diff` is not supported.

have added maven library: uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5

EnricoMi commented 3 months ago

Looks like Databricks does not like how we extend DataFrame with .diff.

You can diff as follows:

diff(left, right)
EnricoMi commented 3 months ago

Maybe spark.createDataFrame does not reurn a pyspark.sql.dataframe.DataFrame but some Databricks DataFrame.

Could you please execute the following on your side and share the output?

print(type(left))
shri-0509 commented 3 months ago

yes it will result in different dataframe <class 'pyspark.sql.connect.dataframe.DataFrame'>.

EnricoMi commented 3 months ago

I managed to reproduce the issue with a local Spark Connect server. Looks like the diffing does not work with Spark Connect. Will investigate a fix.

shri-0509 commented 3 months ago

sure thanq. right now i am doing left_anti join to get added, deleted and inner join to get modified and unchanged. Thinking to use this library to do the same