JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
584 stars 194 forks source link

RunningMoments code: does it belong here? #96

Closed sglyon closed 10 years ago

sglyon commented 10 years ago

I have implemented some Julia code to compute the first four moments of an array (right now just Array{T, 1}) with just one pass. This code is pretty much Julia code based on the C++ version here.

This code also allows users to compute "running" versions of these moments. This means that if users need to track moments for data that is becoming available in smaller chunks, no computation will be repeated as new data is added to the sample. I did some testing against the code here and these are my results:

julia> moms(x) = [mean(x), std(x), var(x), skewness(x), kurtosis(x)]
moms (generic function with 1 method)

julia> x = randn(5000000);

julia> @time run_m = RunningMoments(); push!(run_m, x); moms(run_m);
elapsed time: 3.666e-6 seconds (128 bytes allocated)

julia> @time moms(x);
elapsed time: 0.032538688 seconds (67904 bytes allocated)

julia> moms(x) - moms(run_m)
5-element Array{Float64,1}:
-1.1601e-17
-2.85327e-14
-5.70655e-14
-2.08167e-17
 3.75255e-13

We can also see that adding data little by little results in the correct moments (this runs without any issues):

srand(42)
N = 70
data = rand(N)
run_m = RunningMoments()

push!(run_m, data[1])

for i=2:N
    push!(run_m, data[i])
    @test_approx_eq moms(data[1:i]) moms(run_m)
end

My question is if this code would belong in src/moments.jl in this package? Right now I have it sitting in a repo where I keep useful stuff that doesn't belong anywhere else. Some tests can be found here.

Let me know if anyone things this belongs here. If it does, we will probably need to ask John Cook for permission before including it, as I am not sure what license is attached to the content of his blog (link from above).

StefanKarpinski commented 10 years ago

Cool. I suspect that John will not mind us using this code under a reasonable (e.g. MIT) license.

johnmyleswhite commented 10 years ago

I already got John's ok when I translated his code a while back. Will send a link soon.

-- John

On Oct 11, 2014, at 11:18 AM, Stefan Karpinski notifications@github.com wrote:

Cool. I suspect that John will not mind us using this code under a reasonable (e.g. MIT) license.

— Reply to this email directly or view it on GitHub.

johnmyleswhite commented 10 years ago

Here's the original translation that John approved: https://github.com/johnmyleswhite/RunningStats.jl

I believe John said his code was under the public domain. Trying to find his e-mail.

sglyon commented 10 years ago

Haha nice, should have known somebody has already done this!

johnmyleswhite commented 10 years ago

I think it'd be good to make a clean repository for streaming analytics. We could pull in John Cook's code, some reservoir sampling code I have, HyperLogLog and a few others.

Probably best to do this in a separate repo so we can keep StatsBase minimal.

sglyon commented 10 years ago

Sounds good, closing.

Feel free to post a link to the streaming stats repo when it is settled.

StefanKarpinski commented 10 years ago

StreamStats would be a good package name. Having a good single-pass quantile estimator would be nice too. There are some quite good algorithms out there.

kmsquire commented 10 years ago

On Saturday, October 11, 2014, Stefan Karpinski notifications@github.com wrote:

Having a good single-pass quantile estimator would be nice too. There are some quite good algorithms out there.

Why not even make if more general and allow the desired percentile to be a parameter?

johnmyleswhite commented 10 years ago

I'm pretty sure that's possible with the probabilistic data structures Stefan has in mind.