Feature request regarding time series with different lengths

KishManani commented 6 months ago

Currently skforecast only allows the following configuration for time series of different lengths:

Image taken from here.

It is possible to create the feature vectors for many of the observations in rows 2, 3, and 4 of the above table and pass them to a model for training.

Time series like rows 2, 3, and 4 are very common in online retail forecasting. For example, products may be online for certain periods and offline during other periods, potentially to go back online again later. During those offline periods the target variable is missing. It is still possible to use the periods where the product was online to train a model.

It would be great if skforecast could handle time series of different lengths when it is like rows 2, 3, and 4 above.

Thanks, Kishan

JavierEscobarOrtiz commented 6 months ago

Hello @KishManani,

Thank you very much for your comments 😄.

You're right, in some scenarios, it would be valuable to train the model also with the periods in which a certain series has values, even if this series will not be available again.

The main problem we face in this situation is that you cannot predict this series if it ends before the others, because you do not have enough values (last window) to create the predictors.

We will include it somehow in the next release. 😄

In a single-series forecasting problem, do you think the following type of series makes sense? I will first try to impute the values instead of removing them.

[0, 1, 2, 3, 4, NaN, 6, 7, 8, 9]

Best,

Javi

KishManani commented 6 months ago

Hi @JavierEscobarOrtiz!

The main problem we face in this situation is that you cannot predict this series if it ends before the others, because you do not have enough values (last window) to create the predictors.

Yes, that is correct. In the case of online retail forecasting this may not be a problem (depending on the goal) though because you're interested in forecasting the current set of products which are online, the products which used to be online provide a lot of additional training data which will help forecast new products which may have only been online for a shorter time period.

We will include it somehow in the next release. 😄

Great to hear! 😃

In a single-series forecasting problem, do you think the following type of series makes sense? I will first try to impute the values instead of removing them.

I've not had this come up in my own experience, however, I think it could make sense. Consider a single time series which has a large gap in the middle (comparable to the amount of observations you have). You could impute the missing data but this is likely to distort the time series given that the gap is large. An alternative to imputing the missing data and then training a model is only train the model where you have valid training data. I think the skforecast documentation recommends a hack where you use sample weights to give zero weight to the imputed missing periods.

Best wishes, Kishan

JoaquinAmatRodrigo commented 6 months ago

Hi, I agree with both points.

For global multi-series models, we should allow learning from all available data, even if some series are shorter than others. Expected feature for release 0.12.0
For single-series models, I think the available strategy of imputing missing values but not allowing them to influence the learning process (giving a weight of 0) allows to avoid the problems related to having missing values in the training data set, but not taking the risk of imputed values distorting the model.

KishManani commented 5 months ago

Hi Joaquin,

For single-series models, I think the available strategy of imputing missing values but not allowing them to influence the learning process (giving a weight of 0) allows to avoid the problems related to having missing values in the training data set, but not taking the risk of imputed values distorting the model.

Just wanted to highlight that this means creating a training matrix that is larger than required because some rows will not contribute to training at all. Having a larger training matrix could slow down training which consequently could slowdown backtesting and hyperparameter tuning. An easy workaround would be to filter out the rows with zero weight from the training matrix in the skforecast backend.

Best wishes, Kishan

JoaquinAmatRodrigo commented 5 months ago

Good point! The training process is by far the most time consuming step, therefore the main step of optimization. Removing 0-weighted rows can be done in just a few lines. We will implement this for the next realease. Thanks for sharing your ideas!

JavierEscobarOrtiz commented 2 months ago

Hello @KishManani,

The functionality to include series with different lengths in ForecasterMultiSeries is now available in skforecast 0.12.0:

https://skforecast.org/latest/user_guides/multi-series-with-different-length-and-different_exog

Hope it helps!

JoaquinAmatRodrigo / skforecast

Feature request regarding time series with different lengths #621