antoinecarme / pyaf

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.
BSD 3-Clause "New" or "Revised" License
459 stars 72 forks source link

Investigate cross-validation methods for time series #53

Closed antoinecarme closed 5 years ago

antoinecarme commented 7 years ago

PyaF uses a simple separation of the total dataset into estimation/training and test/hold-out datasets (80% and 20% respectively by default, customizable).

Try to evaluate the impact of using cross-validation : gain in model quality/stability/accuracy versus practical aspects (cpu time and memory usage).

Use the "rolling forecasting origin" method described here :

https://www.otexts.org/fpp/2/5

antoinecarme commented 7 years ago

Nice document by Mathworks !

https://www.mathworks.com/help/econ/rolling-window-estimation-of-state-space-models.html

antoinecarme commented 7 years ago

Forecast evaluation with Stata

https://ideas.repec.org/p/boc/usug10/10.html

antoinecarme commented 7 years ago

Package 'forecastHybrid'

Nice R package that handles cross-validation for time series

https://cran.r-project.org/web/packages/forecastHybrid/forecastHybrid.pdf

Details

Cross validation of time series data is more complicated than regular k-folds or leave-one-out cross validation of datasets without serial correlation since observations xt and xt+n are not independent.

The cvts() function overcomes this obstacle using two methods:

rolling cross validation where an initial training window is used along with a forecast horizon and the initial window used for training grows by one observation each round until the training window and the forecast horizon capture the entire series or

a non-rolling approach where a fixed training length is used that is shifted forward by the forecast horizon after each iteration.

For the rolling approach, training points are heavily recycled, both in terms of used for fitting and in generating forecast errors at each of the forecast horizons from 1:maxHorizon

In contrast, the models fit with the non-rolling approach share less overlap, and the predicted forecast values are also only compared to the actual values once. The former approach is similar to leave-one-out cross validation while the latter resembles k-fold cross validation.

As a result,

rolling cross validation requires far more iterations and computationally takes longer to complete,

but

a disadvantage of the non-rolling approach is the greater variance and general instability of cross-validated errors.

antoinecarme commented 6 years ago

scikit-learn has a time series split cross-validator

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit

from scikit-learn user-guide :

http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

image

BenjaminLarrousse commented 6 years ago

Hey, Are you working on a solution for the cross-validation ? The facebook Prophet package has something implemented: https://github.com/facebook/prophet/blob/master/notebooks/diagnostics.ipynb

antoinecarme commented 6 years ago

Hey @BenjaminLarrousse

Yet another implementation for this feature. Thanks for the feedback. I will look at it closer.

PyAF is designed to be a standalone product and cannot reuse an existing time series forecasting software (existed way before facebook/prophet was made public).

Are you interested in implementing it ?

Cheers,

Antoine

BenjaminLarrousse commented 6 years ago

Yes sure, I was thinking about a specific implementation into your package. But their code can help do that. I don't have much time right now to implement it but if I manage to find some free time, why not !

Cheers

antoinecarme commented 5 years ago

Finished. See #105 for the selected implementation.