capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
472 stars 125 forks source link

`report` throws an exception when all columns match but no rows match #276

Closed SimonBFrank closed 6 months ago

SimonBFrank commented 6 months ago

When two Spark dataframes have all matching columns but zero matching rows, the report method of SparkCompare throws an exception. Below is the an example piece of code and the result.


Source Code:

import datacompy as dc
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame(
    [
        (1, "foo", 1),  
        (2, "bar", 1),
    ],
    ["id", "label", "tmp"],  
)

df2 = spark.createDataFrame(
    [
        (3, "foo", 1), 
        (4, "bar", 1),
    ],
    ["id", "label", "tmp"],
)

comparison_report = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=["id", "label"], cache_intermediates=True)
comparison_report.report()

Result:

****** Column Summary ******
Number of columns in common with matching schemas: 3
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 0
Number of columns in compare but not base: 0

****** Row Summary ******
Number of rows in common: 0
Number of rows in base but not compare: 2
Number of rows in compare but not base: 2
Number of duplicate rows found in base: 0
Number of duplicate rows found in compare: 0

****** Row Comparison ******
Number of rows with some columns unequal: 0
Number of rows with all columns equal: 0

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
----> 1 comparison_report.report()
/python3.8/site-packages/datacompy/spark.py in report(self, file)
    889         self._merge_dataframes()
    890         self._print_num_of_rows_with_column_equality(file)
--> 891         self._print_row_matches_by_column(file)

/lib/python3.8/site-packages/datacompy/spark.py in _print_row_matches_by_column(self, myfile)
    723             if self.columns_match_dict[key][MatchType.MISMATCH.value]
    724         }
--> 725         columns_fully_matching = {
    726             key: self.columns_match_dict[key]
    727             for key in self.columns_match_dict

/lib/python3.8/site-packages/datacompy/spark.py in <dictcomp>(.0)
    726             key: self.columns_match_dict[key]
    727             for key in self.columns_match_dict
--> 728             if sum(self.columns_match_dict[key])