Closed alokgogate closed 2 years ago
Thanks :-)
This behavior is expected, as percentiles, median and quartiles are computed using Spark's approximate methods for performance, namely, approx_percentile
SQL expression.
You can parameterize it using the precision
argument of the median
method, which has a default of 0.01. If you make it 0.0001 and smaller, it will get closer to the exact value you got from Pandas.
For small datasets (like the Titanic example), the difference can be quite big. For really huge datasets, though, the precision should have little impact.
Hope this helps!
Really nice work
How to compute the correct median value of a handyspark dataframe? I tried to compute the median value of a column through pandas and I can the correct value, but when i compute the median value for the same column and same dataset through handyspark I get a different value. Any clue as to why this may happen?
Thanks!