Open shreya-goddu opened 1 week ago
@shreya-goddu So this is a known issue (point 2). Mainly because with the legacy SparkCompare you are using drops duplicates before it does anything. This was one of the reasons we wanted to align to Pandas and have moved this to be deprecated and have the new Spark versions (Pandas on Spark API and also Spark SQL)
Just to show you what I mean:
from datacompy.spark.legacy import LegacySparkCompare. # starting v0.12.0
pdf1 = pd.DataFrame({'a': [1, 1], 'b': [1, 2]})
pdf2 = pd.DataFrame({'a': [1, 1], 'b': [2, 1]})
sdf1 = spark.createDataFrame(pdf1)
sdf2 = spark.createDataFrame(pdf2)
compare = LegacySparkCompare(spark, sdf1, sdf2, join_columns=["a"])
compare.rows_both_mismatch.show() # gives the same issue you were experiencing
+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1| 1| 2| false|
+---+------+---------+-------+
# under the hood one of the rows is actually dropped.
compare.base_df.show()
+---+---+
| a| b|
+---+---+
| 1| 1|
+---+---+
For point 1:
For array types I don’t know if we support those in terms of our compare logic. So it isn’t surprising it says [1,2]
doesn’t equal [2,1]
. This would depend on your definition also. Someone might say they are equal others might not.
Found a few cases where datacompy returns mismatches with Spark dataframe comparisons when the data is not sorted (using v0.11.3)
Cases where it reports mismatches when it shouldn't:
DF1 +---+------+ | a| b| +---+------+ | 1|[1, 2]| +---+------+ DF2 +---+------+ | a| b| +---+------+ | 1|[2, 1]| +---+------+
`import datacompy
comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a']) comparison.rows_both_mismatch.show()`
+---+------+---------+-------+ | a|b_base|b_compare|b_match| +---+------+---------+-------+ | 1|[1, 2]| [2, 1]| false| +---+------+---------+-------+
DF1 +---+---+ | a| b| +---+---+ | 1| 1| | 1| 2| +---+---+ DF2 +---+---+ | a| b| +---+---+ | 1| 2| | 1| 1| +---+---+
`import datacompy
comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a']) comparison.rows_both_mismatch.show()`
+---+------+---------+-------+ | a|b_base|b_compare|b_match| +---+------+---------+-------+ | 1| 1| 2| false| +---+------+---------+-------+