capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
420 stars 124 forks source link

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

Open fdosani opened 3 months ago

fdosani commented 3 months ago

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.

  1. There is an additional _merge_right column which is not in the original dataframes, which could cause a bit of confusion for users.
  2. We're displaying the column names as their aliases, which could also be a bit confusing. It would be best to translate them back to their original names.

Not a blocker for this, but we should open a follow-up issue to keep track of this.

import pandas as pd
import pyspark.pandas as ps
pdf1 = pd.DataFrame.from_dict({"id": [1,2,3,4,5], "a": [2,3,2,3, 2], "b": ["a", "b", "c", "d", ""]})
pdf2 = pd.DataFrame.from_dict({"id": [1,2,3,4,5, 6], "a": [2,3,2,3, 2, np.nan], "b": ["a", "b", "c", "d", "", pd.NA]})
df1 = ps.DataFrame(pdf1)
df2 = ps.DataFrame(pdf2)

Spark

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        3     5
1       df2        3     6

Column Summary
--------------

Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5

Column Comparison
-----------------

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0

Columns with Unequal Values or Types
------------------------------------

  Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0      a     int64   float64          0       0.0            0

Sample Rows with Unequal Values
-------------------------------

Sample Rows Only in df2 (First 10 Columns)
------------------------------------------

   id_df2  a_df2 b_df2  _merge_right
5       6    NaN  None          True

Pandas

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        3     5
1       df2        3     6

Column Summary
--------------

Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5

Column Comparison
-----------------

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0

Columns with Unequal Values or Types
------------------------------------

  Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0      a     int64   float64          0       0.0            0

Sample Rows with Unequal Values
-------------------------------

Sample Rows Only in df2 (First 10 Columns)
------------------------------------------

   id   a     b
0   6 NaN  <NA>

Originally posted by @jdawang in https://github.com/capitalone/datacompy/pull/275#pullrequestreview-1958520393