capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
420 stars 124 forks source link

Spark Comparison fails to detect similarities in unordered data #314

Open shreya-goddu opened 1 week ago

shreya-goddu commented 1 week ago

Found a few cases where datacompy returns mismatches with Spark dataframe comparisons when the data is not sorted (using v0.11.3)

Cases where it reports mismatches when it shouldn't:

  1. Column has an array data type and the dataframes have the same values but different orders

DF1 +---+------+ | a| b| +---+------+ | 1|[1, 2]| +---+------+ DF2 +---+------+ | a| b| +---+------+ | 1|[2, 1]| +---+------+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a']) comparison.rows_both_mismatch.show()`

+---+------+---------+-------+ | a|b_base|b_compare|b_match| +---+------+---------+-------+ | 1|[1, 2]| [2, 1]| false| +---+------+---------+-------+

  1. Dataframes with the non-unique join_column values fail to detect similarity when unordered

DF1 +---+---+ | a| b| +---+---+ | 1| 1| | 1| 2| +---+---+ DF2 +---+---+ | a| b| +---+---+ | 1| 2| | 1| 1| +---+---+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a']) comparison.rows_both_mismatch.show()`

+---+------+---------+-------+ | a|b_base|b_compare|b_match| +---+------+---------+-------+ | 1| 1| 2| false| +---+------+---------+-------+

fdosani commented 1 week ago

@shreya-goddu So this is a known issue (point 2). Mainly because with the legacy SparkCompare you are using drops duplicates before it does anything. This was one of the reasons we wanted to align to Pandas and have moved this to be deprecated and have the new Spark versions (Pandas on Spark API and also Spark SQL)

Just to show you what I mean:

from datacompy.spark.legacy import LegacySparkCompare.  # starting v0.12.0
pdf1 = pd.DataFrame({'a': [1, 1], 'b': [1, 2]})
pdf2 = pd.DataFrame({'a': [1, 1], 'b': [2, 1]})
sdf1 = spark.createDataFrame(pdf1)
sdf2 = spark.createDataFrame(pdf2)

compare = LegacySparkCompare(spark, sdf1, sdf2, join_columns=["a"])
compare.rows_both_mismatch.show()  # gives the same issue you were experiencing
+---+------+---------+-------+
|  a|b_base|b_compare|b_match|
+---+------+---------+-------+
|  1|     1|        2|  false|
+---+------+---------+-------+

# under the hood one of the rows is actually dropped.
compare.base_df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+
fdosani commented 1 week ago

For point 1:

For array types I don’t know if we support those in terms of our compare logic. So it isn’t surprising it says [1,2] doesn’t equal [2,1]. This would depend on your definition also. Someone might say they are equal others might not.