Open lampretl opened 2 months ago
This has nothing to do with DataFrames. You want an algo for median which runs in O(n). You need an algo which uses the median of medians concept. The implementation you use in Statistics.jl does not seem to be O(n), but I could be incorrect.
@stensmo I was hoping for a function from DataFrames.jl that would be comparable in performance to polars
one. For a new user, migrating from Python to Julia, what is the equivalent or recommended way of obtaining quantiles?
In Julia DataFrames, you can apply (almost) any function, including your own. The median function does not belong to DataFrames, but it is a standard function in Julia. Writing your own functons and applying them to a DataFrame is super easy in Julia. That is why you will love it, but it takes some time to get used to. You can apply the standard Julia median function to a DataFrame or a superfast implementation, that you find someone else did, or do it yourself.
@nalimilan - this issue should be migrated to Statistics.jl but I do not have privileges to do so. Could you please do it? Thank you!
I can't either. Apparently that's only possible between repos of the same org.
Anyway it seems this is already a known problem, and you had even made a PR for it? https://github.com/JuliaStats/Statistics.jl/pull/91
EDIT: specifically, computing the IQR seems very similar to https://github.com/JuliaStats/Statistics.jl/issues/84
I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:
pandas:
polars:
DataFrames.jl + Julia:
Is there a better, more efficient way to compute medians and IQRs in Julia?