Closed mdancho84 closed 5 months ago
This is what I've been exploring over at pytimetk: https://github.com/business-science/pytimetk/blob/master/src/pytimetk/crossvalidation/time_series_cv.py
Hey Matt, first and foremost, thanks for your interest.
Just to make sure I am getting the request correctly:
| ======= /// *** |
| =========== /// *** |
| =============== /// *** |
| ================== /// *** |
Reversing order should be fairly quick adjustment as iterator. While having an expanding backward option may be more challenging.
For now the lazy way of doing it is to return the splits and reverse them manually.
Yep I think you have it. The top need is the reversed order with the first split being the most recent.
So which one of the two is the expected/desired behaviour for expanding window in "reverse" order?
Rolling window with most recent first.
Hey Matt, I am still not sure about what you are asking here.
If the desirable is having the same splits just in different order (see figures below), then I don't see why your validation score would change.
My rationale is that the most recent time series data has the most information.
I can agree with this, but more than having the splits in different order, you could give different importance to different folds to let the most recent ones have more weight in the final decision.
The package gives the user enough flexibility to let this kind of decisions happen afterwards.
On the other hand, if you want to have a fixed test set, and a moving training set, then we can have a (somewhat separate) discussion on that, and why I don't think it is a good idea to support it.
Thanks for your message @FBruzzesi. This is what I'm planning to accomplish inside of pytimetk
:
In timetk
(comparable R package), I have a function called time_series_cv()
and then some plotting utilities to help visualize the Time Series Cross Validation Sets. https://business-science.github.io/timetk/reference/time_series_cv.html
When the resampling is performed, the first set is always the most recent data. Here it's 24 months of data. But the user could have specified numerically 24 periods since it's a monthly frequency dataset. The initial is the window of training data. Skip is how many periods should be the gap.
Your package essentially does the same thing but in reverse. That's consistent with how Rob Hyndman does it, but in my experience isn't the best way to do time series cross validation (again because newer information is typically more relevant, and what people do is they just do the top N resamples where N is 5 or so). So this way if they select slice_limit = 3
they will get the 3 most recent splits.
resample_spec <- time_series_cv(data = m750,
initial = "6 years",
assess = "24 months",
skip = "24 months",
cumulative = FALSE,
slice_limit = 3)
#> Using date_var: date
resample_spec
#> # Time Series Cross Validation Plan
#> # A tibble: 3 × 2
#> splits id
#> <list> <chr>
#> 1 <split [72/24]> Slice1
#> 2 <split [72/24]> Slice2
#> 3 <split [72/24]> Slice3
When visualized the sets produced look like this:
resample_spec %>%
plot_time_series_cv_plan(date, value, .interactive = FALSE)
Like your package, it supports time series panels or groups so that all time series are split based on the sliding windows.
walmart_tscv <- walmart_sales_weekly %>%
time_series_cv(
date_var = Date,
initial = "12 months",
assess = "3 months",
skip = "3 months",
slice_limit = 4
)
The only thing that I also do is provide a "cumulative" argument that simply extends the data to the first timestamp in the data.
# Cumulative TRUE
library(timetk)
library(tidyverse)
?time_series_cv
walmart_tscv <- walmart_sales_weekly %>%
time_series_cv(
date_var = Date,
initial = "12 months",
assess = "3 months",
skip = "3 months",
slice_limit = 4,
cumulative = TRUE
)
walmart_tscv %>%
plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE)
Now I see what you mean. I need to double check how easy it to flip the logic without the need to maintain two different algorithms. Will take a closer look during the weekend.
Regarding taking the first N splits (in any direction), that's very easy and it could be enough to write a helper function that wraps the CV with itertools.islice
. I will add it as an issue
Ok sounds good. Happy to help in any way I can.
Hey @mdancho84 , I just released v0.2.0 with the new feature. I added documentation regarding that in a dedicated paragraph. Let me know if you have any feedback on that
Excellent. I'll check it out this weekend.
Update: Docs look great. Will test out 0.2.0 for integration with pytimetk.
I have finally got around to integrating timebasedcv
into pytimetk
. It works great!
I've created a TimeSeriesCV() and TimeSeriesCVSplitter() based on yours. Major difference is a couple of diagnostic tools and the default to be mode = backwards.
It's really coming together: https://business-science.github.io/pytimetk/reference/TimeSeriesCVSplitter.html#pytimetk.TimeSeriesCVSplitter
Hey @mdancho84 👋🏼 thanks a ton for the kind words! I am very happy to see that you are finding it useful 🙌🏼
First off, great package. I've been tinkering with CV for implementation in my
pytimetk
package, which aims to make it easier to do time series operations in python.I'm considering integrating this package. One issue is that my preference is to have cross validation start with the most recent data (meaning the first split should be the most recent data). My rationale is that the most recent time series data has the most information.
This may be preference, but it's a mistake to start with oldest data and think that the CV results will mean much.
So is it possible to have an argument to begin CV splits with the most recent data (basically inverting the default) for both sliding window and expanding (cumulative window)?
Thanks! -Matt