How to compute the median value of a column

dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes

MIT License

188 stars 24 forks source link

Thanks :-)

This behavior is expected, as percentiles, median and quartiles are computed using Spark's approximate methods for performance, namely, approx_percentile SQL expression. You can parameterize it using the precision argument of the median method, which has a default of 0.01. If you make it 0.0001 and smaller, it will get closer to the exact value you got from Pandas.

For small datasets (like the Titanic example), the difference can be quite big. For really huge datasets, though, the precision should have little impact.

Hope this helps!

dvgodoy / handyspark

How to compute the median value of a column #14