Closed abieler closed 7 years ago
I like the idea, but I think we need to make sure that it is done efficiently. The code looks like it is allocating a lot of temporary memory.
https://github.com/JuliaCI/BenchmarkTools.jl is a great tool for looking into performance aspects
Yes I am aware of the memory allocation. However, this is necessary for example when there is a column of Ints which have to be converted to Floats for centering and rescaling. Hence to be on the save side I allocate the full column and replace the old one in the DataFrame. Do you see a way around this? Of course this could be done only for columns of type Int...
Well, one approach you could play with is to specialize the methods on the fact that the eltype has to change or not
Surprisingly to me:
the lazy implementation center_a!() performs within a factor of 2 from the computation on a plain Matrix center_m!().
center_b!() (what I thought would be the most efficient way) takes about a factor of 100 longer.
Do you happen to know the reason for this? Otherwise I ll keep looking and ask the DataFrames people.
function center_a!(D, colname, mu)
D[colname] = D[colname] .- mu
end
function center_b!(D, colname, mu)
nobs = size(D, 1)
for i in 1:nobs
D[i, colname] = D[i, colname] - mu
end
end
function center_m!(M, icol, mu)
nobs = size(M, 1)
for i in 1:nobs
M[i, icol] = M[i, icol] - mu
end
end
N = round(Int, 1e5)
df1 = DataFrame(A=rand(N), B=collect(1:N))
df2 = copy(df1)
M = convert(Matrix{Float64}, df1)
mu = 2.2
@benchmark center_a!(df1, :A, mu)
BenchmarkTools.Trial:
memory estimate: 793.84 KiB
allocs estimate: 10
--------------
minimum time: 163.202 μs (0.00% GC)
median time: 175.629 μs (0.00% GC)
mean time: 189.919 μs (6.26% GC)
maximum time: 636.381 μs (66.89% GC)
--------------
samples: 10000
evals/sample: 1
@benchmark center_b!(df2, :A, mu)
BenchmarkTools.Trial:
memory estimate: 12.18 MiB
allocs estimate: 798468
--------------
minimum time: 19.044 ms (0.00% GC)
median time: 20.047 ms (3.98% GC)
mean time: 19.851 ms (2.46% GC)
maximum time: 23.366 ms (3.87% GC)
--------------
samples: 252
evals/sample: 1
@benchmark center_m!(M, 1, mu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 71.911 μs (0.00% GC)
median time: 72.049 μs (0.00% GC)
mean time: 73.159 μs (0.00% GC)
maximum time: 135.411 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Mhm, unsure. Probably has to do with how DataArray
works.
One thing to watch out in general when benchmarking is that one should always interpolate the variables to get accurate estimates. In this case, however, it doesn't seem to make a difference
@benchmark center_a!($(copy(df1)), :A, $(mu))
Hey, me again..
Just (really just) started on centering for DataFrames. As last time I figured early input is beneficial.
Thanks! Andre