JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 367 forks source link

`describe` is slow #3411

Closed jariji closed 9 months ago

jariji commented 9 months ago
julia> let xs = rand(10_000_000)
           @time describe(xs)
           end
Summary Stats:
Length:         10000000
Missing Count:  0
Mean:           0.499959
Minimum:        0.000000
1st Quartile:   0.249912
Median:         0.499909
3rd Quartile:   0.749991
Maximum:        1.000000
Type:           Float64
  2.162965 seconds (52 allocations: 76.298 MiB, 0.31% gc time)

julia> let xs = rand(10_000_000)
           @time maximum(xs)
           end
  0.006608 seconds

That describe seems pretty slow to me.

DataFrames v1.6.1

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 35 on 24 virtual cores
jariji commented 9 months ago

It's because median and quantile are slow, but perhaps this could be parallelized so we don't have to wait for each of them sequentially.

bkamins commented 9 months ago

This is an issue for StatsBase.jl:

julia> @which describe(xs)
describe(x)
     @ StatsBase ~\.julia\packages\StatsBase\WLz8A\src\scalarstats.jl:920
bkamins commented 9 months ago

I opened https://github.com/JuliaStats/StatsBase.jl/issues/912 as it is impossible to transfer issues across organizations