Open rragundez opened 3 years ago
certainly this will be so much clearer and main maintenance will be on PySpark itself.
@rragundez - thanks for creating this issue.
I could see how eqNullSafe
could be useful, especially for large column comparison operations. You could do something like df.withColumn("are_cols_equal", col1.eqNullSafe(col2))
and then run a filtering operation and make sure are_cols_equal
is always equal to true
. I did something similar to this in spark-fast-tests but don't really use this implementation of the method. Should do some more benchmarking and see if this is faster.
Is this what you're suggesting?
Since Spark 2.3 there is the Pyspark function eqNullSafe, this seems a much better way to compare columns and also can be used to compare dataframes.
Advantages:
For dataframe it would mean that there has to be some sort of loop over columns and then a reduce to check all member of the resulting column are true. I think it is worth the change due to the 2 reasons given above,