capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
472 stars 125 forks source link

It seems `SparkCompare` object has no attribute 'sample_mismatch` ? #283

Closed pangjac closed 3 months ago

pangjac commented 6 months ago

Hi,

I am currently using 0.8.4. For a certain column, I am trying to print a sample_mismatch to check what is the value different for this column between two pyspark dataframe : It seems SparkCompare object has no attribute 'sample_mismatch` ? image

Wondering if this is the version issue or not. However, the latest documentation does not list sample_mismatch in datacompy.spark module as well.

If confirmed, could you provide a quick poke on the reason why this method is not inherited. If this is no specific blockers, I'd happy to contribute to dev this method under spark module.

Thanks for this wonderful package!

fdosani commented 6 months ago

Hey @pangjac first off thank you for supporting the package!

sample_mismatch doesn't exist for the SparkCompare class in that version of datacompy. We have a branch which is waiting review where we are shifting to pandas on pyspark if you are ok using that instead. v0.8.4 is fairly old so I'd highly recommend bumping up if you are able to. That old version of SparkCompare doesn't inherit from the base class as it was built aside from it. It has been something which has been bugging me hence the new branch waiting review and deprecating the old Spark class.

If you look at the new implementation (which aligns better to the pandas, polars, and fugue logic) we will have that function natively for Spark.

Alternatively I wonder if the internal dataframe: _all_rows_mismatched would give you what you need. you can filter on the column you are interested in since its just a Spark DF.

fdosani commented 5 months ago

@pangjac Just wanted to follow up and see if this was solved for you? Thanks!