Closed tpapp closed 1 year ago
I have looked at some textbooks, and both forms are in use (ie with denominators N
and N-k
, where N
is the length of the sequence and k
is the lag). The form with N
has some nice definiteness properties when written as a matrix, but it is not unbiased. Some texts argue that since k << N
, the difference should not matter. Variograms & related lit tend to use N-k
. Perhaps the documentation could clarify.
This is on purpose. I think it is the standard thing to do, e.g. it is the estimator in Brockwell and Davis and it is what R does. Notice that the N-k
version isn't unbiased. In general, autocovariance (or any other time series) estimators aren't unbiased. PRs with improvements to the documentation are always welcome.
I noticed this weird (to me) behavior just today. The normalization has a large effect on estimates of the spectrum derived from the estimated ACF. Compare the two ACF estimates below
The white line is the StatsBase behavior and the orange one is normalized by (N-k+1)
If estimating spectra based on these two ACFs, we obtain the following result
I find the normalization by N
as opposed to N-k
odd since the name of the function is autocov
but it does not give you cov
between signal and it's lagged version. It's also not mentioned in the docstrings, so I think this issue deserves to remain open.
What is the data generating process and what is the sample size in those examples?
T = 100
fs = 10
F = [1,3]
t = range(0,step=1/fs,length=T)
y = sum(sin.(2π*f .* t .+ 2π*rand()) for f in F) .+ 0.1 .* randn.()
Brockwell and Davis does discuss this, and they propose the use of n
in their covariance functions. They also mention n-k
, which is biased, but is 'less' biased. For smaller datasets, this makes a significant difference. If you were doing this by hand, you only have n-k
pairs; use of n
wouldn't make sense.
The question is whether retaining n
to guarantee the positive definite covariance matrix is critical? I'm needing to roll my own cov functions to allow the n-k
approach.
The n-k-1
estimator is unbiased. The current implementation shrinks large lags to 0 by dividing by n
, which usually reduces mean squared error. The major problem with the n-k
estimator is that it is not even consistent -- as the dataset grows, the number of lags estimated increases as well. Because the second-to-last lag always has a variance based on two samples, the variance of the last lag. By contrast, dividing by n
yields a consistent estimator whenever the true autocovariance goes to 0 (the usual case). It is especially bad when trying to estimate not particular lags, but instead the variance of a time series' mean. In that case, using n-k
will frequently give inconsistent estimates, e.g. that a series' total autocorrelation exceeds 1.
The only real solution is something like #774, I think.
Closed as duplicate of #774
Consider an alternating sequence of
1
and-1
, which has autocovariances1
and-1
at even and odd lags.But
I wonder if one should divide by the number of elements in the calculation instead of the length of the whole sequence.