blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.22k stars 1.21k forks source link

Unit test + settings for autocorrelation #273

Closed MaxBenChrist closed 6 years ago

MaxBenChrist commented 6 years ago

See the discussion in #223

@nikhaas found some inconsistencies regarding the autocorrelation feature calculator

Also we are missing unit tests for it.

So I propose that we add some unit tests and make sure we use the right estimator (either statsmodes ACF or np.corrcoef)

nikhase commented 6 years ago

See this SO discussion for reference, too.

nikhase commented 6 years ago

I think I found the reason for the different values and I am confident about the result because I managed to calculate np.corrcoef and statsmodels acf by hand.

pd.Series.autocorr() invokes np.corrcoef under the hood. Therefore, the calculations match the following formula: Pearson's correlation coefficient

Assume we have a raw time series x. Basically, for X and Y, the time series x[:(len(x)-lag)] and x[lag:] are passed to np.corrcoef, which are subseries of x. In this formula, the mean (used in cov) and stddev of the subseries X and Y respectively are used, not the mean and stddev of the whole series x.

In contrast, the formula for estimation of the autocorrelation from Wikipedia: estimate of the autocorrelation

This differs from the above correlation coefficient by the parameters $\mu$ and $\sigma$. One can use the sample mean and variance of whe whole time series to estimate this. statsmodels.tsa.stattools.acf uses this.

Example Plots

The results differ quite heavily. I present plots of the raw time series and the lags for one weak-sense stationary AR-Process with 15, 150 and 1500 sampled elements and one example of Stock Data:

AR with 15 observations ar_15 AR with 150 observations ar_150 AR with 1500 observations ar_1500 GOOG, AMZN, MSFT stock data (definitely not stationary): stock_data autocorr and acf spread pretty much in higher lags.

Conclusion

Compared to Matlab

See a more detailed investigation in this notebook

So the values of the tsfresh autocorrelation feature extractor are not the same that are used in tthe agg_autocorrelation feature extractor. Shall we adapt the calculation in autocorrelation?

References

MaxBenChrist commented 6 years ago

Great analysis, as always @nikhaas.

Regarding the Stock data: Which Function seems more useful?

Which feature is relevant highly depends on the task itself. I found it hard to say which way of calculating the autocorrelation is "right".

I am tempted add another parameter and then use both ways of calculating the autocorrelation.

However, I feel that if we have to decide on one way, we should stick with the stats model implementation, as it corresponds to the equation that is also contained in wikipedia

nikhase commented 6 years ago

Well, the advantage of autocorr is the minimal API.

I am tempted add another parameter and then use both ways of calculating the autocorrelation.

I do not recommend that. At least for me, this would be quite confusing. Maybe one last example to support the decision:. Given the following time series

[ 0.  1.  2.  3.]
mean: 1.5
var: 1.25

autocorr yields this correlation coefficients:

[1.0, 1.0, 0.99999999999999989, nan]

whereas acf yields

[ 1.  0.33333333   -0.6   -1.8]

I suggest not to use the statsmodels' acf function in the feature calculator autocorrelation, but to "hard-code" the calculation just like this:

def acf_by_hand(x, lag):
    # This is important: If a series is passed, the product below is calculated
    #  based on the index, which corresponds to squaring the series
    if type(x) is pd.Series:
        x = x.values
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the mean of the whole series x to calculate Cov
    x_mean = np.mean(x)
    sum_product = np.sum((y1-x_mean)*(y2-x_mean))
    # Return the normalized covariance
    return sum_product / ((len(x) - lag) * np.var(x))

Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.

MaxBenChrist commented 6 years ago

Statsmodels' implementation calculates all lags up to the given lag, which is unnecessary overhead. This way, the API stays the same, and only the lags of interest are calculated.

Alright, you convinced me!

nikhase commented 6 years ago

I think this issue can be closed now.

MaxBenChrist commented 6 years ago

indeed