Why NO k-Fold Cross Validation?

alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

https://www.alkaline-ml.com/pmdarima

MIT License

1.59k stars 234 forks source link

Why NO k-Fold Cross Validation? #398

Closed TaherHabib closed 3 years ago

TaherHabib commented 4 years ago

Hello,

Great work with pmdarima! Thanks :)

However, I wonder why K-fold cross validation scheme is not provided in the package?

Since ARIMA is an autoregressive model requiring the data to be stationary, I don't understand why K-fold cross validation could not be used in this case.

tgsmith61591 commented 4 years ago

Hey there. The reason we can't use K-fold cross validation is that it would introduce data leakage.

KfoldCV

For time series, all validation folds must be in the future. If you consider your time series observations as being ordered training data, folds that omit a portion in the middle of the sequence actually give your model information it should not have apriori (namely the value of future observations). By using the approaches to CV we've implemented in the package, you can prevent data leakage:

Does that answer your question?

TaherHabib commented 4 years ago

Thanks for the response!

I agree with your reasoning above for not using K-fold CV for time series data in order to preserve the temporal order in the training and testing procedures. However, my request for K-fold was motivated by the fact that ARIMA models are applied to stationary stochastic processes, where the statistical distribution of the stochastic process has an unchanging (over time) mean, variance, etc. As a result, it makes sense to me to apply K-fold CV – at least to the case of strongly stationary time series data. But, I am not so sure about this.

Please let me know if there's anything missing here :)