gdkrmr / WeightedOnlineStats.jl

Weighted version of OnlineStats.jl
MIT License
10 stars 4 forks source link

Fit on WeightedCovMatrix converts data to Vector{T} #30

Closed annuges closed 4 years ago

annuges commented 4 years ago

I noticed that fitting a SVector to a WeightedCovMatrix allocates while using a normal Vector does not.

using WeightedOnlineStats, StaticArrays, BenchmarkTools
s=WeightedCovMatrix()
@benchmark fit!($s,v,w) setup=(v=rand(3);w=rand())

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.736 ns (0.00% GC)
  median time:      18.937 ns (0.00% GC)
  mean time:        18.991 ns (0.00% GC)
  maximum time:     44.889 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

s=WeightedCovMatrix()
@benchmark fit!($s,v,w) setup=(v=rand(SVector{3,Float64});w=rand())

BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     42.181 ns (0.00% GC)
  median time:      43.492 ns (0.00% GC)
  mean time:        54.512 ns (16.53% GC)
  maximum time:     35.818 μs (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     991

When looking at the source code I see that there is an explit

xx = convert(Vector{T}, x)

which forces the SVector to a dynamic array leading to an allocation. Perhaps there is a better way to do this? In my local code i have just removed the convert to recover the performance but im not sure if letting the downstream code handle any type mismatches is suficcient for all usecases

annuges commented 4 years ago

OnlineStats seems to handle it just fine:

s=CovMatrix()
@benchmark fit!($s,v) setup=(v=rand(SVector{3,Float64}))

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.015 ns (0.00% GC)
  median time:      16.116 ns (0.00% GC)
  mean time:        16.242 ns (0.00% GC)
  maximum time:     40.439 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
gdkrmr commented 4 years ago

We did this, because we really cared about the precision of the calculation. Still this should really be possible without memory allocation, thanks for noticing this.

gdkrmr commented 4 years ago

I took a stab at solving this. @annuges could you please check if this works for you now? If you are doing a pca, you also need the very latest StatsBase v0.32.1

@meggart, I am not sure if a more elegant solution is possible, the following does not work:

function _fit!(o::WeightedCovMatrix{T}, x::V{U}, w) where T where V where U
   xx = convert(V{T}, x)
   ...
end
annuges commented 4 years ago

Thanks, that fixes the issue for me.

A different Idea I had just now would be to use a helper function for the conversion:

vecconvert(::Type{T}, v::AbstractVector{VT}) where {T,VT} =  convert(Vector{T}, v)
vecconvert(::Type{T}, v::AbstractVector{T}) where {T}  = v
vecconvert(::Type{T}, v::SVector{VN,VT}) where {T,NT,VT} =convert(SVector{VN,T}, v)

One could then use this as: xx = vecconvert(T, x) instead of the xx = convert(Vector{T}, x)

gdkrmr commented 4 years ago

Then the package has to depend on StaticArrays, I will have to think about this.

One question: I always worked under the assumption that for most applications you have to read the data from some slow connection (e.g. a HDD) and therefore the bottleneck is in the reading. Do you see an actual performance gain in a real world application?

gdkrmr commented 4 years ago

Considering that covariance matrices usually are small, the entire type could be implemented using StaticArrays (maybe as WeightedStaticCovMatrix, PR welcome), this would justify a dependency on StaticArrays but as it is now there is a solution and the dependency is not needed.

annuges commented 4 years ago

Ah yes, if did not think of that. That’s not really worth it then for such a minor thing. Your solution nicely resolves the issue when no conversion is necessary and if someone would need allocation free type conversion of static vectors they can always do it themselves beforehand.

In my application I do a multithreaded loop over a large number of online stats. So I read in a batch of data and then all the different stats work on the same data with some individual preprocessing. The fitting itself is a minor step so just the raw increase in time probably wouldn’t be an issue but it seems like the allocations and garbage collection do not play too well with the threaded loop.

gdkrmr commented 4 years ago

Ok, I will just merge this, thanks!