G-Research / spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Apache License 2.0
193 stars 26 forks source link

Error: 'JavaPackage' object is not callable #242

Open rish-shar opened 3 months ago

rish-shar commented 3 months ago

Description

I have two PySpark dataframes, source_df and target_df. I ran pip install pyspark-extension to install diff.

Spark Version - 3.4.1 Scala Version - 2.12

When I run source_df.diff(target_df), I get the below error -

TypeError                                 Traceback (most recent call last)
File <command-2426417243632400>, line 1
----> 1 source_df.diff(target_df, )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:427, in diff(self, other, *id_columns)
    367 def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame:
    368     """
    369     Returns a new DataFrame that contains the differences between this and the other DataFrame.
    370     Both DataFrames must contain the same set of column names and data types.
   (...)
    425     :rtype DataFrame
    426     """
--> 427     return Differ().diff(self, other, *id_columns)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:337, in Differ.diff(self, left, right, *id_columns)
    274 """
    275 Returns a new DataFrame that contains the differences between the two DataFrames.
    276 
   (...)
    334 :rtype DataFrame
    335 """
    336 jvm = left._sc._jvm
--> 337 jdiffer = self._to_java(jvm)
    338 jdf = jdiffer.diff(left._jdf, right._jdf, _to_seq(jvm, list(id_columns)))
    339 return DataFrame(jdf, left.session_or_ctx())

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:270, in Differ._to_java(self, jvm)
    269 def _to_java(self, jvm: JVMView) -> JavaObject:
--> 270     jdo = self._options._to_java(jvm)
    271     return jvm.uk.co.gresearch.spark.diff.Differ(jdo)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:245, in DiffOptions._to_java(self, jvm)
    235 def _to_java(self, jvm: JVMView) -> JavaObject:
    236     return jvm.uk.co.gresearch.spark.diff.DiffOptions(
    237         self.diff_column,
    238         self.left_column_prefix,
    239         self.right_column_prefix,
    240         self.insert_diff_value,
    241         self.change_diff_value,
    242         self.delete_diff_value,
    243         self.nochange_diff_value,
    244         jvm.scala.Option.apply(self.change_column),
--> 245         self.diff_mode._to_java(jvm),
    246         self.sparse_mode,
    247         self.default_comparator._to_java(jvm),
    248         self._to_java_map(jvm, self.data_type_comparators, key_to_java=self._to_java_data_type),
    249         self._to_java_map(jvm, self.column_name_comparators)
    250     )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:37, in DiffMode._to_java(self, jvm)
     36 def _to_java(self, jvm: JVMView) -> JavaObject:
---> 37     return jvm.uk.co.gresearch.spark.diff.DiffMode.withNameOption(self.name).get()

TypeError: 'JavaPackage' object is not callable

Any help would be appreciated.

liteart commented 2 months ago

The python pip package does only contain the stubs for code completion. Spark requires the java package to be installed (the python package is not necessary on Databricks).

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, and the extension will load as expected.

rish-shar commented 2 months ago

The python pip package does only contain the stubs for code completion. Spark requires the java package to be installed (the python package is not necessary on Databricks).

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, and the extension will load as expected.

@liteart How do I achieve this on Databricks? Do I need to add the package at cluster level then?

EnricoMi commented 2 months ago

See: https://docs.databricks.com/en/libraries/package-repositories.html#maven-or-spark-package https://www.databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-libraries.html

EnricoMi commented 2 months ago

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, ...

In your setup (Scala 2.12, Spark 3.4.1), this should be uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4.