How to handle data distribution shifts

luistelmocosta commented 3 years ago

Hello, I am using fbprophet to forecast the sales of a store. The structure of the data is hierarchical, and I am predicting at the highest level.

Example: The store can sell Books, Movies, Toys. Books can be Action, Adventure, same with movies and toys. And then you have the actual product.

I am forecasting the revenue at the Books, Movies, Toys level. I would like to know how to handle a scenario where a new book is added to the store and suddenly it starts to generate a high volume of sales thus, increasing the revenue for that unit.

Any research on this? Is there any common approach?

Thank you!

tcuongd commented 3 years ago

Hey there! That's a very interesting question - could I just confirm do you:

Already have knowledge of the new books that are going to be added to the library that may have a big impact, or
Are you more saying that this could happen randomly, and want to account for this in the forecast?

There's no standardized way to account for 1) (at least within Prophet), so that might require manual adjustment to the forecast.

However if your question is more around 2), I think it depends on how much historical data you have where the spikes have occurred. e.g. has a popular Toy been introduced in the past that caused a spike in sales. If this has occurred, it might be good to break down the sales volumes by category, then forecast the individual categories. Prophet assigns trend changepoints (see here) to past data, and incorporates this into the trend uncertainty. This is represented in the yhat_lower and yhat_upper values - you can think of yhat_upper as the best case scenario, where most of the new items we introduced will cause spikes in sales. Remember that the more historical data Prophet is given around past "spikes", the better it will be able to learn how often they occur and how big the spikes might be.

If you're in a situation where there's not much historical data on spikes, one trick you could employ is to create synthetic (fake) data based on what kind of popular items could be introduced and how much they might improve sales. Keep in mind that this requires domain knowledge and your results will be heavily influenced by the assumptions made.

luistelmocosta commented 3 years ago

It is more aligned to 1. I only have knowledge of many new books added since April 2021, but I cannot quantify the number of new books, due to logistic limitations.

Some details on the issue: the time series has monthly data points from January 2018. The model works very well till March 2021, but new books added since April 2021. Thus the revenue since April 2021 shifted and having a higher error.

facebook / prophet

How to handle data distribution shifts #1997