multiple time series (ex: locations, skus,stores) | prophet (many models) vs regression (single model)

rquintino commented 4 years ago

hi everyone!

Been quite a fan of prophet for long time. This is more a brainstorming/question than any issue really. Sorry if it ends up to be a naive/dumb question (I'm not so strong on internals/stats...).

For datasets with multiple time series, (hundreds or more, where we need both aggregate and drill down values) we're currently hyper param searching for each. This can be costly though (time and compute), which got me to go back and refresh the benefits of prophet vs using typical regression based model, which could easily fit all the dataset in a single run (faster).

So the question that is on my mind is, why should I use prophet for multi-time series and all the extra compute load and not just a regular xgboost plus manual features, single dataset?

-advantages of prophet seem to be very low effort feature prep/code (for me significant), also better trend support

Is this right? wrong? Others? Could one hot regressors work?

other options for multiple time series?

thanks!!

ps-some notes on many models (a model per time series) pattern

(little bit overkill with automl prob, but overall same pattern I'm testing with prophet now) https://github.com/microsoft/solution-accelerator-many-models

(doesnt do grid search by time series but it's possible) https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

bletham commented 4 years ago

Multivariate forecasting of a large number of time series is definitely a problem that Prophet wasn't really designed for and for which there are other options, including ML approaches with appropriate feature engineering. I don't have much personal experience with large-scale multivariate forecasting so I won't be able to comment much on those alternatives, but I did want to say that with a very large number of time series I think it may be possible to at least make the hyperparameter tuning faster.

Here is what I would suggest for hyperparameter tuning:

Do a big sweep over the same set of hyperparameter configurations for all of the time series. For each time series, keep track of the performance of each configuration.
From that big set, I suspect you will be able to identify a small set of configurations (maybe 5?) such that for every time series, one of those 5 is pretty good. Maybe not the best, but probably not too far from it. I expect this to be the case because in a large collection of related time series, there will probably be clusters of time series that are similar and that have similar HPO optima (the exact optimum may not be the same, but there will be a point that is close to optimal for all of them). You'd have to come up with a strategy to come up with this covering set of configurations, but it could be something like: find the configuration that is within 10% of optimal for the largest number of time series; remove those time series as covered, and then find the next configuration that is within 10% of optimal for the most remaining; etc. until nearly everyone is covered.
In the future, when you have either a new time series or an updated time series, do HPO, you don't do a big sweep, you just sweep over that small set.

That way you would only need to do the big sweep once, and the quality of the sweep should not be degraded too much (to the extent that there is a small set of configurations that provides a good configuration for the the bulk of time series.

rquintino commented 4 years ago

much thanks @bletham, awesome insights! We already doing the first step like you described, also guided by the diagnostics page hints on HPO (sample grid below, but rare to use it fully). Very interesting tip on the small set of configurations, makes sense. Adding to backlog!

Another I was about to try next was assume parameters were independent, and try tuning by order of typical importance (based on diagnostics page hints for example) to reduce the grid size. ex. first tune changepoint_prior_scale for all locations. Then pick best per location and tune second best parameter for each location and so on. Should reduce a bit, hopefully still being competitive

Other option would be cluster timeseries and grid search per cluster, but fitting individually later.

Reflecting also that if overall aggregate error is also important, very often it is (ex: sum trips for all locations), an additional point of consideration is that time series will have different weights and prob compute for HPO should take that into account (more weight=more tries).

Regarding the more general topic of large number of time series, starting to feel is a gap currently, so we can avoid so much custom feature engineering (ex: trending/detrending keeps coming to mind), and have such a great experience, analyst in the loop, and interpretability like we have in prophet. Wonder what would be needed for that or if something similar already exists.

btw-For anyone interested in the topic leaving also similar question created on the msft many models accelerator, which would bet is even more compute intensive for these workloads (https://github.com/microsoft/solution-accelerator-many-models/issues/106).

ps-sample-grid search, typically we run this on dbricks with parallel map (so we opt to not use the prophet cv parallel options)

again thanks a lot for your superb work, precious time and useful thoughts! :)

bletham commented 4 years ago

Those other directions for HPO make sense to me!

facebook / prophet

multiple time series (ex: locations, skus,stores) | prophet (many models) vs regression (single model) | scalability #1687