JuliaData / MemPool.jl

High-performance parallel and distributed datastore for Julia
Other
23 stars 15 forks source link

Very slow approx_size for DataFrames #48

Open DrChainsaw opened 3 years ago

DrChainsaw commented 3 years ago

When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size is the bottleneck due to it falling back to Base.summarysize.

Here is a quick MWE:

julia>  using BenchmarkTools, DataFrames, MemPool

julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  61.03 MiB
  allocs estimate:  1999540
  --------------
  minimum time:     110.895 ms (4.59% GC)
  median time:      119.604 ms (2.47% GC)
  mean time:        122.978 ms (2.83% GC)
  maximum time:     146.009 ms (1.46% GC)
  --------------
  samples:          41
  evals/sample:     1

Here is a sketch of an alternative implementation which is much faster:

julia> function MemPool.approx_size(df::DataFrame)
       dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
       namesize = mapreduce(MemPool.approx_size, +, names(df))
       return dsize + namesize
       end

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  13
  --------------
  minimum time:     535.700 μs (0.00% GC)
  median time:      636.800 μs (0.00% GC)
  mean time:        664.967 μs (0.00% GC)
  maximum time:     1.525 ms (0.00% GC)
  --------------
  samples:          7499
  evals/sample:     1

The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.

Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.