JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

performance of `median` and `iqr` compared to python libraries #3462

Open lampretl opened 2 months ago

lampretl commented 2 months ago

I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:

pandas:

import numpy as np, pandas as pd, scipy
n,m=10**8,10;   df = pd.DataFrame(np.random.rand(n,m))
%time df.median(axis=0)
%time df.quantile(0.5)
%time df.quantile(0.75)-df.quantile(0.25)
%time scipy.stats.iqr(df,axis=0)
CPU times: user 23.4 s, sys: 921 ms, total: 24.4 s
Wall time: 24.4 s
CPU times: user 20.3 s, sys: 830 ms, total: 21.1 s
Wall time: 21.2 s
CPU times: user 39.9 s, sys: 1.71 s, total: 41.6 s
Wall time: 41.6 s
CPU times: user 25.6 s, sys: 5.28 s, total: 30.9 s
Wall time: 31 s

polars:

import numpy as np, polars as pl
n,m=10**8,10;   df = pl.DataFrame(np.random.rand(n,m), schema=[f"x{k}" for k in range(m)])
%time df.median()
%time df.quantile(0.75,interpolation='linear')
%time df.quantile(0.75,interpolation='linear') - df.quantile(0.25,interpolation='linear')
CPU times: user 21.4 s, sys: 3.51 s, total: 24.9 s
Wall time: 2.95 s
CPU times: user 19.2 s, sys: 3.86 s, total: 23.1 s
Wall time: 2.95 s
CPU times: user 43.8 s, sys: 11.4 s, total: 55.2 s
Wall time: 6.44 s

DataFrames.jl + Julia:

using DataFrames, StatsBase
n,m=10^1,10;   df = DataFrame(rand(n,m), :auto); 
function f1(df::DataFrame) ::Vector{Float64}  return map(median, eachcol(df)) end
function f2(df::DataFrame) ::Vector{Float64}  return map(iqr, eachcol(df)) end
function f3(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = median(df[:,j]) end; return res end
function f4(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = iqr(df[:,j]) end; return res end
@time f1(df);
@time f2(df);
@time f3(df);
@time f4(df);
14.686185 seconds (53 allocations: 14.901 GiB, 4.56% gc time)
86.758428 seconds (53 allocations: 7.451 GiB, 0.36% gc time)
8.259288 seconds (146 allocations: 22.352 GiB, 9.15% gc time)
50.395623 seconds (144 allocations: 14.901 GiB, 0.47% gc time)

Is there a better, more efficient way to compute medians and IQRs in Julia?

stensmo commented 2 months ago

This has nothing to do with DataFrames. You want an algo for median which runs in O(n). You need an algo which uses the median of medians concept. The implementation you use in Statistics.jl does not seem to be O(n), but I could be incorrect.

lampretl commented 2 months ago

@stensmo I was hoping for a function from DataFrames.jl that would be comparable in performance to polars one. For a new user, migrating from Python to Julia, what is the equivalent or recommended way of obtaining quantiles?

stensmo commented 2 months ago

In Julia DataFrames, you can apply (almost) any function, including your own. The median function does not belong to DataFrames, but it is a standard function in Julia. Writing your own functons and applying them to a DataFrame is super easy in Julia. That is why you will love it, but it takes some time to get used to. You can apply the standard Julia median function to a DataFrame or a superfast implementation, that you find someone else did, or do it yourself.

bkamins commented 2 months ago

@nalimilan - this issue should be migrated to Statistics.jl but I do not have privileges to do so. Could you please do it? Thank you!

nalimilan commented 2 months ago

I can't either. Apparently that's only possible between repos of the same org.

Anyway it seems this is already a known problem, and you had even made a PR for it? https://github.com/JuliaStats/Statistics.jl/pull/91

EDIT: specifically, computing the IQR seems very similar to https://github.com/JuliaStats/Statistics.jl/issues/84