Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288
Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.
There is an additional _merge_right column which is not in the original dataframes, which could cause a bit of confusion for users.
We're displaying the column names as their aliases, which could also be a bit confusing. It would be best to translate them back to their original names.
Not a blocker for this, but we should open a follow-up issue to keep track of this.
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id_df2 a_df2 b_df2 _merge_right
5 6 NaN None True
Pandas
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id a b
0 6 NaN <NA>
Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.
_merge_right
column which is not in the original dataframes, which could cause a bit of confusion for users.Not a blocker for this, but we should open a follow-up issue to keep track of this.
Spark
Pandas
Originally posted by @jdawang in https://github.com/capitalone/datacompy/pull/275#pullrequestreview-1958520393