Unit test + settings for autocorrelation

MaxBenChrist commented 6 years ago

See the discussion in #223

@nikhaas found some inconsistencies regarding the autocorrelation feature calculator

Also we are missing unit tests for it.

So I propose that we add some unit tests and make sure we use the right estimator (either statsmodes ACF or np.corrcoef)

nikhase commented 6 years ago

See this SO discussion for reference, too.

nikhase commented 6 years ago

I think I found the reason for the different values and I am confident about the result because I managed to calculate np.corrcoef and statsmodels acf by hand.

pd.Series.autocorr() invokes np.corrcoef under the hood. Therefore, the calculations match the following formula: $Pearson's correlation coefficient$

Assume we have a raw time series x. Basically, for X and Y, the time series x[:(len(x)-lag)] and x[lag:] are passed to np.corrcoef, which are subseries of x. In this formula, the mean (used in cov) and stddev of the subseries X and Y respectively are used, not the mean and stddev of the whole series x.

In contrast, the formula for estimation of the autocorrelation from Wikipedia: $estimate of the autocorrelation$

This differs from the above correlation coefficient by the parameters $\mu$ and $\sigma$. One can use the sample mean and variance of whe whole time series to estimate this. statsmodels.tsa.stattools.acf uses this.

Example Plots

The results differ quite heavily. I present plots of the raw time series and the lags for one weak-sense stationary AR-Process with 15, 150 and 1500 sampled elements and one example of Stock Data:

AR with 15 observations ar_15 AR with 150 observations ar_150 AR with 1500 observations ar_1500 GOOG, AMZN, MSFT stock data (definitely not stationary): stock_data autocorr and acf spread pretty much in higher lags.

Conclusion

The difference between the pandas and statsmodels version lie in the mean subtraction and normalization / variance division:
- autocorr does nothing more than passing subseries of the original series to np.corrcoef
- the sample mean and sample variance of these subseries differ from the overall series mean and variance
- acf, in contrary, uses the overall series sample mean and sample variance in the formula
- therefore the results are different
The differences get smaller when the time series is a(strongly) stationary process (mean and variance do not change over time)
Also, differences might get smaller for short lags copared to the time series length (e.g only look at lags 1-10 for a time series with length 1000)
It seems not useful to calculate autocorrelation values for lags that are close to the time series length (e.g. until lag 150 for a time series with length 150)
Regarding the Stock data: Which Function seems more useful?

Compared to Matlab

the autocorr function from Pandas does not mimic Matlab's autocorr, but seems to correspond to the Matlab xcorr (cross-corr) function
Matlab's autocorr calculates the sample autocorrelation as shown in the second formula
Matlab's xcorr is defined for 2 arbitrary series which can differ in mean and variance

See a more detailed investigation in this notebook

So the values of the tsfresh autocorrelation feature extractor are not the same that are used in tthe agg_autocorrelation feature extractor. Shall we adapt the calculation in autocorrelation?

References

MaxBenChrist commented 6 years ago

Great analysis, as always @nikhaas.

Regarding the Stock data: Which Function seems more useful?

Which feature is relevant highly depends on the task itself. I found it hard to say which way of calculating the autocorrelation is "right".

I am tempted add another parameter and then use both ways of calculating the autocorrelation.

However, I feel that if we have to decide on one way, we should stick with the stats model implementation, as it corresponds to the equation that is also contained in wikipedia

nikhase commented 6 years ago

Well, the advantage of autocorr is the minimal API.

I am tempted add another parameter and then use both ways of calculating the autocorrelation.

I do not recommend that. At least for me, this would be quite confusing. Maybe one last example to support the decision:. Given the following time series

[ 0.  1.  2.  3.]
mean: 1.5
var: 1.25

autocorr yields this correlation coefficients:

[1.0, 1.0, 0.99999999999999989, nan]

whereas acf yields

[ 1.  0.33333333   -0.6   -1.8]

I suggest not to use the statsmodels' acf function in the feature calculator autocorrelation, but to "hard-code" the calculation just like this:

def acf_by_hand(x, lag):
    # This is important: If a series is passed, the product below is calculated
    #  based on the index, which corresponds to squaring the series
    if type(x) is pd.Series:
        x = x.values
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the mean of the whole series x to calculate Cov
    x_mean = np.mean(x)
    sum_product = np.sum((y1-x_mean)*(y2-x_mean))
    # Return the normalized covariance
    return sum_product / ((len(x) - lag) * np.var(x))

Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.

MaxBenChrist commented 6 years ago

Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.

Alright, you convinced me!

nikhase commented 6 years ago

I think this issue can be closed now.

MaxBenChrist commented 6 years ago

indeed

blue-yonder / tsfresh