Use eqNullSafe instead of collect

MrPowers / chispa

PySpark test helper methods with beautiful error messages

https://mrpowers.github.io/chispa/

MIT License

551 stars 63 forks source link

Use eqNullSafe instead of collect #8

Open rragundez opened 3 years ago

rragundez commented 3 years ago

Since Spark 2.3 there is the Pyspark function eqNullSafe, this seems a much better way to compare columns and also can be used to compare dataframes.

Advantages:

It comes form the main library hence no need to adjust Chispa if later on the library decides to change the way dataframes interact with collect
Solves the NaN and Null problem

For dataframe it would mean that there has to be some sort of loop over columns and then a reduce to check all member of the resulting column are true. I think it is worth the change due to the 2 reasons given above,

rragundez commented 3 years ago

certainly this will be so much clearer and main maintenance will be on PySpark itself.

MrPowers commented 3 years ago

@rragundez - thanks for creating this issue.

I could see how eqNullSafe could be useful, especially for large column comparison operations. You could do something like df.withColumn("are_cols_equal", col1.eqNullSafe(col2)) and then run a filtering operation and make sure are_cols_equal is always equal to true. I did something similar to this in spark-fast-tests but don't really use this implementation of the method. Should do some more benchmarking and see if this is faster.

Is this what you're suggesting?