FBruzzesi / timebasedcv

Time based splits for cross validation
https://fbruzzesi.github.io/timebasedcv/
MIT License
31 stars 1 forks source link

CV Splits - Is it Possible to Start With Most Recent Data? #41

Closed mdancho84 closed 5 months ago

mdancho84 commented 6 months ago

First off, great package. I've been tinkering with CV for implementation in my pytimetk package, which aims to make it easier to do time series operations in python.

I'm considering integrating this package. One issue is that my preference is to have cross validation start with the most recent data (meaning the first split should be the most recent data). My rationale is that the most recent time series data has the most information.

This may be preference, but it's a mistake to start with oldest data and think that the CV results will mean much.

So is it possible to have an argument to begin CV splits with the most recent data (basically inverting the default) for both sliding window and expanding (cumulative window)?

Thanks! -Matt

mdancho84 commented 6 months ago

This is what I've been exploring over at pytimetk: https://github.com/business-science/pytimetk/blob/master/src/pytimetk/crossvalidation/time_series_cv.py

FBruzzesi commented 6 months ago

Hey Matt, first and foremost, thanks for your interest.

Just to make sure I am getting the request correctly:

Reversing order should be fairly quick adjustment as iterator. While having an expanding backward option may be more challenging.

For now the lazy way of doing it is to return the splits and reverse them manually.

mdancho84 commented 6 months ago

Yep I think you have it. The top need is the reversed order with the first split being the most recent.

FBruzzesi commented 6 months ago

So which one of the two is the expected/desired behaviour for expanding window in "reverse" order?

mdancho84 commented 6 months ago

Rolling window with most recent first.

FBruzzesi commented 6 months ago

Hey Matt, I am still not sure about what you are asking here.

If the desirable is having the same splits just in different order (see figures below), then I don't see why your validation score would change.

My rationale is that the most recent time series data has the most information.

I can agree with this, but more than having the splits in different order, you could give different importance to different folds to let the most recent ones have more weight in the final decision.

The package gives the user enough flexibility to let this kind of decisions happen afterwards.

On the other hand, if you want to have a fixed test set, and a moving training set, then we can have a (somewhat separate) discussion on that, and why I don't think it is a good idea to support it.

Figures

Current behaviour

fig1-mini

Reverse order

fig2-mini

mdancho84 commented 6 months ago

Thanks for your message @FBruzzesi. This is what I'm planning to accomplish inside of pytimetk:

In timetk (comparable R package), I have a function called time_series_cv() and then some plotting utilities to help visualize the Time Series Cross Validation Sets. https://business-science.github.io/timetk/reference/time_series_cv.html

Creating the CV Sets:

When the resampling is performed, the first set is always the most recent data. Here it's 24 months of data. But the user could have specified numerically 24 periods since it's a monthly frequency dataset. The initial is the window of training data. Skip is how many periods should be the gap.

Your package essentially does the same thing but in reverse. That's consistent with how Rob Hyndman does it, but in my experience isn't the best way to do time series cross validation (again because newer information is typically more relevant, and what people do is they just do the top N resamples where N is 5 or so). So this way if they select slice_limit = 3 they will get the 3 most recent splits.

 resample_spec <- time_series_cv(data = m750,
                                initial     = "6 years",
                                assess      = "24 months",
                                skip        = "24 months",
                                cumulative  = FALSE,
                                slice_limit = 3)
#> Using date_var: date

resample_spec
#> # Time Series Cross Validation Plan 
#> # A tibble: 3 × 2
#>   splits          id    
#>   <list>          <chr> 
#> 1 <split [72/24]> Slice1
#> 2 <split [72/24]> Slice2
#> 3 <split [72/24]> Slice3

When visualized the sets produced look like this:

resample_spec %>%
    plot_time_series_cv_plan(date, value, .interactive = FALSE)

image

Like your package, it supports time series panels or groups so that all time series are split based on the sliding windows.

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4
    )

image

The only thing that I also do is provide a "cumulative" argument that simply extends the data to the first timestamp in the data.

# Cumulative TRUE
library(timetk)
library(tidyverse)

?time_series_cv

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4,
        cumulative  = TRUE
    )

walmart_tscv %>%
    plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE)

walmart_tscv

FBruzzesi commented 6 months ago

Now I see what you mean. I need to double check how easy it to flip the logic without the need to maintain two different algorithms. Will take a closer look during the weekend.

Regarding taking the first N splits (in any direction), that's very easy and it could be enough to write a helper function that wraps the CV with itertools.islice. I will add it as an issue

mdancho84 commented 6 months ago

Ok sounds good. Happy to help in any way I can.

FBruzzesi commented 5 months ago

Hey @mdancho84 , I just released v0.2.0 with the new feature. I added documentation regarding that in a dedicated paragraph. Let me know if you have any feedback on that

mdancho84 commented 5 months ago

Excellent. I'll check it out this weekend.

Update: Docs look great. Will test out 0.2.0 for integration with pytimetk.

mdancho84 commented 1 day ago

I have finally got around to integrating timebasedcv into pytimetk. It works great!

I've created a TimeSeriesCV() and TimeSeriesCVSplitter() based on yours. Major difference is a couple of diagnostic tools and the default to be mode = backwards.

It's really coming together: https://business-science.github.io/pytimetk/reference/TimeSeriesCVSplitter.html#pytimetk.TimeSeriesCVSplitter

image

FBruzzesi commented 22 hours ago

Hey @mdancho84 👋🏼 thanks a ton for the kind words! I am very happy to see that you are finding it useful 🙌🏼