G-Research / spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Apache License 2.0
193 stars 26 forks source link

Comparators error when using pyspark #225

Closed hbashary closed 9 months ago

hbashary commented 9 months ago

Trying to run example in documentation using pyspark but keep getting the following error - AttributeError: 'DiffOptions' object has no attribute 'withComparator' .

Running this in a Glue notebook with Spark version 3.3 and spark-extension_2.12-2.8.0. Same issue when upgrading to spark-extension_2.13-2.11.0. Is this method supported for the python api?

Create 2 dataframes

df_1 = spark.createDataFrame([
    Row(id=1, value=1.0),
    Row(id=2, value=2.0),
    Row(id=3, value=3.0),
])

df_2 = spark.createDataFrame([
    Row(id=1, value=1.0),
    Row(id=2, value=2.02),
    Row(id=3, value=3.05),
])

Run Comparator method

from pyspark.sql.types import DoubleType
from gresearch.spark.diff import DiffOptions, DiffMode, DiffComparators

options = DiffOptions().with_change_column("changes")\
                       .withComparator(DiffComparators.epsilon(0.01).asRelative().asInclusive(), DoubleType)

df_1.diff_with_options(df_2, options, "id").show()

Error - AttributeError: 'DiffOptions' object has no attribute 'withComparator'

EnricoMi commented 9 months ago

You are right, that Python example code in DIFF.md was wrong, it should read with_data_type_comparator(...).

Please modify your code as follows:

-.withComparator(DiffComparators.epsilon(0.01).asRelative().asInclusive(), DoubleType)
+.with_data_type_comparator(DiffComparators.epsilon(0.01).as_relative().as_inclusive(), DoubleType())

I have fixed the DIFF.md.

hbashary commented 9 months ago

Thanks for the quick response. One last question - the map attribute doesn't seem to be supported for python.

options = DiffOptions().with_change_column("changes")\\
                                       .with_data_type_comparator(DiffComparators.map[K,V](false))

Error - AttributeError: type object 'DiffComparators' has no attribute 'map'

EnricoMi commented 9 months ago

Right, the Python API does not support the Map comparator. I haven't yet figured out how to get the key and value types K and V from Python to Scala.

hbashary commented 9 months ago

Thanks Enrico.

EnricoMi commented 9 months ago

I have found a way to provide the MapDiffComparator to Python API: #226

That fix allows for DiffComparators.map(Integer(), LongType()) in Python.

EnricoMi commented 6 months ago

This has been released.