MrPowers / chispa

PySpark test helper methods with beautiful error messages
https://mrpowers.github.io/chispa/
MIT License
551 stars 63 forks source link

Use eqNullSafe instead of collect #8

Open rragundez opened 3 years ago

rragundez commented 3 years ago

Since Spark 2.3 there is the Pyspark function eqNullSafe, this seems a much better way to compare columns and also can be used to compare dataframes.

Advantages:

For dataframe it would mean that there has to be some sort of loop over columns and then a reduce to check all member of the resulting column are true. I think it is worth the change due to the 2 reasons given above,

rragundez commented 3 years ago

certainly this will be so much clearer and main maintenance will be on PySpark itself.

MrPowers commented 3 years ago

@rragundez - thanks for creating this issue.

I could see how eqNullSafe could be useful, especially for large column comparison operations. You could do something like df.withColumn("are_cols_equal", col1.eqNullSafe(col2)) and then run a filtering operation and make sure are_cols_equal is always equal to true. I did something similar to this in spark-fast-tests but don't really use this implementation of the method. Should do some more benchmarking and see if this is faster.

Is this what you're suggesting?