JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

WIP: center for DataFrames #31

Closed abieler closed 7 years ago

abieler commented 7 years ago

Hey, me again..

Just (really just) started on centering for DataFrames. As last time I figured early input is beneficial.

Thanks! Andre

Evizero commented 7 years ago

I like the idea, but I think we need to make sure that it is done efficiently. The code looks like it is allocating a lot of temporary memory.

https://github.com/JuliaCI/BenchmarkTools.jl is a great tool for looking into performance aspects

abieler commented 7 years ago

Yes I am aware of the memory allocation. However, this is necessary for example when there is a column of Ints which have to be converted to Floats for centering and rescaling. Hence to be on the save side I allocate the full column and replace the old one in the DataFrame. Do you see a way around this? Of course this could be done only for columns of type Int...

Evizero commented 7 years ago

Well, one approach you could play with is to specialize the methods on the fact that the eltype has to change or not

abieler commented 7 years ago

Surprisingly to me:

the lazy implementation center_a!() performs within a factor of 2 from the computation on a plain Matrix center_m!().

center_b!() (what I thought would be the most efficient way) takes about a factor of 100 longer.

Do you happen to know the reason for this? Otherwise I ll keep looking and ask the DataFrames people.

function center_a!(D, colname, mu)
    D[colname] = D[colname] .- mu
end

function center_b!(D, colname, mu)
    nobs = size(D, 1)
    for i in 1:nobs
        D[i, colname] = D[i, colname] - mu
    end
end

function center_m!(M, icol, mu)
    nobs = size(M, 1)
    for i in 1:nobs
        M[i, icol] = M[i, icol] - mu
    end
end

N = round(Int, 1e5)
df1 = DataFrame(A=rand(N), B=collect(1:N))
df2 = copy(df1)
M = convert(Matrix{Float64}, df1)
mu = 2.2

@benchmark center_a!(df1, :A, mu)
BenchmarkTools.Trial: 
  memory estimate:  793.84 KiB
  allocs estimate:  10
  --------------
  minimum time:     163.202 μs (0.00% GC)
  median time:      175.629 μs (0.00% GC)
  mean time:        189.919 μs (6.26% GC)
  maximum time:     636.381 μs (66.89% GC)
  --------------
  samples:          10000
  evals/sample:     1

@benchmark center_b!(df2, :A, mu)
BenchmarkTools.Trial: 
  memory estimate:  12.18 MiB
  allocs estimate:  798468
  --------------
  minimum time:     19.044 ms (0.00% GC)
  median time:      20.047 ms (3.98% GC)
  mean time:        19.851 ms (2.46% GC)
  maximum time:     23.366 ms (3.87% GC)
  --------------
  samples:          252
  evals/sample:     1

@benchmark center_m!(M, 1, mu)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     71.911 μs (0.00% GC)
  median time:      72.049 μs (0.00% GC)
  mean time:        73.159 μs (0.00% GC)
  maximum time:     135.411 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Evizero commented 7 years ago

Mhm, unsure. Probably has to do with how DataArray works.

One thing to watch out in general when benchmarking is that one should always interpolate the variables to get accurate estimates. In this case, however, it doesn't seem to make a difference

@benchmark center_a!($(copy(df1)), :A, $(mu))