Closed MaxBenChrist closed 6 years ago
See this SO discussion for reference, too.
I think I found the reason for the different values and I am confident about the result because I managed to calculate np.corrcoef
and statsmodels acf
by hand.
pd.Series.autocorr()
invokes np.corrcoef
under the hood. Therefore, the calculations match the following formula:
Assume we have a raw time series x
. Basically, for X and Y, the time series x[:(len(x)-lag)]
and x[lag:]
are passed to np.corrcoef
, which are subseries of x
. In this formula, the mean (used in cov
) and stddev of the subseries X and Y respectively are used, not the mean and stddev of the whole series x
.
In contrast, the formula for estimation of the autocorrelation from Wikipedia:
This differs from the above correlation coefficient by the parameters $\mu$ and $\sigma$. One can use the sample mean and variance of whe whole time series to estimate this. statsmodels.tsa.stattools.acf
uses this.
The results differ quite heavily. I present plots of the raw time series and the lags for one weak-sense stationary AR-Process with 15, 150 and 1500 sampled elements and one example of Stock Data:
AR with 15 observations
AR with 150 observations
AR with 1500 observations
GOOG, AMZN, MSFT stock data (definitely not stationary):
autocorr
and acf
spread pretty much in higher lags.
autocorr
does nothing more than passing subseries of the original series to np.corrcoef
acf
, in contrary, uses the overall series sample mean and sample variance in the formulaautocorr
function from Pandas does not mimic Matlab's autocorr
, but seems to correspond to the Matlab xcorr
(cross-corr) functionautocorr
calculates the sample autocorrelation as shown in the second formulaxcorr
is defined for 2 arbitrary series which can differ in mean and varianceSee a more detailed investigation in this notebook
So the values of the tsfresh autocorrelation
feature extractor are not the same that are used in tthe agg_autocorrelation
feature extractor.
Shall we adapt the calculation in autocorrelation
?
Great analysis, as always @nikhaas.
Regarding the Stock data: Which Function seems more useful?
Which feature is relevant highly depends on the task itself. I found it hard to say which way of calculating the autocorrelation is "right".
I am tempted add another parameter and then use both ways of calculating the autocorrelation.
However, I feel that if we have to decide on one way, we should stick with the stats model implementation, as it corresponds to the equation that is also contained in wikipedia
Well, the advantage of autocorr
is the minimal API.
I am tempted add another parameter and then use both ways of calculating the autocorrelation.
I do not recommend that. At least for me, this would be quite confusing. Maybe one last example to support the decision:. Given the following time series
[ 0. 1. 2. 3.]
mean: 1.5
var: 1.25
autocorr
yields this correlation coefficients:
[1.0, 1.0, 0.99999999999999989, nan]
whereas acf
yields
[ 1. 0.33333333 -0.6 -1.8]
I suggest not to use the statsmodels' acf
function in the feature calculator autocorrelation
, but to "hard-code" the calculation just like this:
def acf_by_hand(x, lag):
# This is important: If a series is passed, the product below is calculated
# based on the index, which corresponds to squaring the series
if type(x) is pd.Series:
x = x.values
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
x_mean = np.mean(x)
sum_product = np.sum((y1-x_mean)*(y2-x_mean))
# Return the normalized covariance
return sum_product / ((len(x) - lag) * np.var(x))
Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.
Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.
Alright, you convinced me!
I think this issue can be closed now.
indeed
See the discussion in #223
@nikhaas found some inconsistencies regarding the
autocorrelation
feature calculatorAlso we are missing unit tests for it.
So I propose that we add some unit tests and make sure we use the right estimator (either statsmodes ACF or
np.corrcoef
)