JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
584 stars 194 forks source link

autocov incorrect? #273

Closed tpapp closed 1 year ago

tpapp commented 7 years ago

Consider an alternating sequence of 1 and -1, which has autocovariances 1 and -1 at even and odd lags.

using StatsBase
x = ones(20)
x[1:2:end] = -1.0
ac = autocov(x, (1:length(x))-1; demean = false)

But

julia> showcompact(ac)
[1.0, -0.95, 0.9, -0.85, 0.8, -0.75, 0.7, -0.65, 0.6, -0.55, 0.5, -0.45, 0.4, -0.35, 0.3, -0.25, 0.2, -0.15, 0.1, -0.05]                                             

I wonder if one should divide by the number of elements in the calculation instead of the length of the whole sequence.

tpapp commented 7 years ago

I have looked at some textbooks, and both forms are in use (ie with denominators N and N-k, where N is the length of the sequence and k is the lag). The form with N has some nice definiteness properties when written as a matrix, but it is not unbiased. Some texts argue that since k << N, the difference should not matter. Variograms & related lit tend to use N-k. Perhaps the documentation could clarify.

andreasnoack commented 7 years ago

This is on purpose. I think it is the standard thing to do, e.g. it is the estimator in Brockwell and Davis and it is what R does. Notice that the N-k version isn't unbiased. In general, autocovariance (or any other time series) estimators aren't unbiased. PRs with improvements to the documentation are always welcome.

baggepinnen commented 4 years ago

I noticed this weird (to me) behavior just today. The normalization has a large effect on estimates of the spectrum derived from the estimated ACF. Compare the two ACF estimates below acf The white line is the StatsBase behavior and the orange one is normalized by (N-k+1) If estimating spectra based on these two ACFs, we obtain the following result spectrum I find the normalization by N as opposed to N-k odd since the name of the function is autocov but it does not give you cov between signal and it's lagged version. It's also not mentioned in the docstrings, so I think this issue deserves to remain open.

andreasnoack commented 4 years ago

What is the data generating process and what is the sample size in those examples?

baggepinnen commented 4 years ago
T = 100
fs = 10
F = [1,3]
t = range(0,step=1/fs,length=T)
y = sum(sin.(2π*f .* t .+ 2π*rand()) for f in F) .+ 0.1 .* randn.()
quebbs commented 4 years ago

Brockwell and Davis does discuss this, and they propose the use of n in their covariance functions. They also mention n-k, which is biased, but is 'less' biased. For smaller datasets, this makes a significant difference. If you were doing this by hand, you only have n-k pairs; use of n wouldn't make sense. The question is whether retaining n to guarantee the positive definite covariance matrix is critical? I'm needing to roll my own cov functions to allow the n-k approach.

ParadaCarleton commented 2 years ago

The n-k-1 estimator is unbiased. The current implementation shrinks large lags to 0 by dividing by n, which usually reduces mean squared error. The major problem with the n-k estimator is that it is not even consistent -- as the dataset grows, the number of lags estimated increases as well. Because the second-to-last lag always has a variance based on two samples, the variance of the last lag. By contrast, dividing by n yields a consistent estimator whenever the true autocovariance goes to 0 (the usual case). It is especially bad when trying to estimate not particular lags, but instead the variance of a time series' mean. In that case, using n-k will frequently give inconsistent estimates, e.g. that a series' total autocorrelation exceeds 1.

The only real solution is something like #774, I think.

ParadaCarleton commented 1 year ago

Closed as duplicate of #774