dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes
MIT License
188 stars 24 forks source link

How to compute the median value of a column #14

Closed alokgogate closed 2 years ago

alokgogate commented 5 years ago

Really nice work

How to compute the correct median value of a handyspark dataframe? I tried to compute the median value of a column through pandas and I can the correct value, but when i compute the median value for the same column and same dataset through handyspark I get a different value. Any clue as to why this may happen?

Thanks!

dvgodoy commented 5 years ago

Thanks :-)

This behavior is expected, as percentiles, median and quartiles are computed using Spark's approximate methods for performance, namely, approx_percentile SQL expression. You can parameterize it using the precision argument of the median method, which has a default of 0.01. If you make it 0.0001 and smaller, it will get closer to the exact value you got from Pandas.

For small datasets (like the Titanic example), the difference can be quite big. For really huge datasets, though, the precision should have little impact.

Hope this helps!