Question about generalisability of the model parameters

djrscally commented 5 years ago

Hello

Awesome software first of all; kudos and thanks!

I am tasked with generating forecasts for revenue each quarter by sales rep and I'd like to use prophet for that. As far as I can see there's no alternative to training a separate model per sales rep to do that, which is not a significant problem. What is a significant problem is that the profile of the time-series data for the reps seems to be sufficiently different that the values of tuned hyperparameters for one rep may produce poor results for others. For example when tuned for one rep on two years of data, the model predicts very accurately (within 1% of the true figure for the whole of Q1, though the individual dates had larger errors). The same hyperparameter values result in an error of over 60% for another rep.

Although I could build some automated tuning, the nature of my problem is going to require me to re-train the model regularly and I suspect that this will mean it takes a very long time, which I'd rather avoid. Is there a way that I can optimise the parameter values for "generalisability" easily?

Cheers Dan

bletham commented 5 years ago

That's a good question. There isn't anything built-in to the package to look at generalizability of hyperparameters, but I think there are some things you could try that would require too much effort.

It sounds like there is significant value to allowing different time series to have different hyperparameters as opposed to a fixed set for all of them. In that case, there is an intermediate option between 1 hyperparameter for all (bad performance) and full sweep for each (too slow), and that would be to come up with a small set of hyperparameter values that cover most of your use cases pretty well, and then just do the sweep over that smaller set in the future. Here's what that might look like:

Do a large sweep of N set of hyperparameter values for all of the time series, and record the cross validation error for each time series.
Find a collection of m << N (maybe m = 5 or so?) hyperparameter values such that for every time series, at least one of them is pretty good.
In the future when you need to retrain, sweep over those m hyperparameter values.

Step 2 there is a combinatorial problem but I'd expect it wouldn't be too challenging to come up with a heuristic that works pretty well at selecting a diverse set of hyperparameter values that covers most of the time series pretty well.

Does that seem reasonable?

djrscally commented 5 years ago

@bletham thanks for the suggestion, it's certainly a reasonable one. I am trying an alternative first - training and cross-validating and then re-tuning if the CV score is outside acceptable range - as I'm hoping that the re-tuning won't prove necessary every time we retrain. If that doesn't result in an acceptable runtime I'll go with your idea.

Thanks again for your help.

chinmaytuw commented 5 years ago

I have a similar problem. Have ~5000 time series with me.

I tried clustering my 5000 timeseries using Dynamic time warping algorithm. Took a sample series from each of the clusters and deduced the best combination of hyperparameters based on last 12 months of SMAPE value. Then applied the respective combination for each of the cluster members.

While it's true that I am not using the "Best" combination by forecasting each member separately, the error rates are within my acceptable limits (5%-10%) and this speeds things up a lot! Hope that made sense.

Would love to hear your thoughts on this approach @bletham, @djrscally !!

djrscally commented 5 years ago

@chinmaytuw 5000 good grief. Clustering them to find similar ts plots for tuning is a really good idea; I'll try that today and report back how the performance seems. Ninja Edit: I'm using the clustering methodology found here, takes a bit of time itself!

5-10% seems really good; my error rates by that metric are much higher, but I'm pretty sure it's just because of the amount of noise (which is 0 mean and Gaussian, so the errors actually cancel out well over the whole of the predicted period.). What process are you using to deduce the best combination of hyperparameters?

chinmaytuw commented 5 years ago

@chinmaytuw 5000 good grief. Clustering them to find similar ts plots for tuning is a really good idea; I'll try that today and report back how the performance seems. Ninja Edit: I'm using the clustering methodology found here, takes a bit of time itself!

5-10% seems really good; my error rates by that metric are much higher, but I'm pretty sure it's just because of the amount of noise (which is 0 mean and Gaussian, so the errors actually cancel out well over the whole of the predicted period.). What process are you using to deduce the best combination of hyperparameters?

This is a pretty awesome resource. DTW sure is computationally intense. I used this resource. It captures various algorithms. The only catch - it is in R.

djrscally commented 5 years ago

@chinmaytuw Yeah, R is beyond me at present :p

Anyway, I clustered the data using DTW and then tuned a random model per cluster. This didn't manage to get the error rates below what I'd call acceptable quite yet.

Today I'm trying a similar fork of thought which is to build a companion dataset that has features describing the things that Prophet uses; overall trend, slope delta at each of the 25 changepoints, min and max and mean of the seasonal components and so on. I shall report back once it's done.

djrscally commented 5 years ago

@chinmaytuw What clustering method did you use eventually? PAM? Or Hierarchical? This unfortunately isn't really helping me so far, although it really feels like it should!

chinmaytuw commented 5 years ago

@djrscally - I am using Hierarchical clustering. PAM is similar to K-means and this explains why not to choose Kmeans for time series clustering.

My workflow: Cluster using Hierarchical > assign clusters to each member by cutting the tree > evaluate the summary statistics for each iteration > pick the best combination.

You can evaluate your cluster size by a variety of metrics. I am using the Silhouette score primarily.

trevor-pope commented 5 years ago

Sorry to rip open an old thread, but @djrscally @chinmaytuw can you tell me more info about what features you used for each time series for clustering? Did you construct seasonal data using something like STL decomposition, or did you fit a model once with some estimated seasonality first?

facebook / prophet

Question about generalisability of the model parameters #979