Open JustinKurland opened 3 years ago
Thanks for this @JustinKurland. I am working on a side-project now that I believe will help users implement production model selection with yardstick
accuracy metrics much easier from modeltime
. However, I think the novel piece here is the ranking procedure, which I had not planned to tackle. Just model comparison based on a metric the user picks.
Automation of accuracy is challenging:
So I'll keep this open. I'd need to see some agreement to a procedure that users want.
I like the idea of the two approaches.
rDEA
https://github.com/jaak-s/rDEAmodeltime.ensemble
. I cover these stacked approaches in the Time Series Course, Module 14: Ensembles. My pleasure @mdancho84. Great to hear about yardstick, that will be a fantastic addition.
There may even be an interesting way that DEA can be leveraged to help optimise ensembling. So, perhaps a 'best in error', but then also a 'top DEA score' ensemble based on say the top 3. However, to begin, as you suggested if there is a desire among users for the ranking procedure (and I suspect there will be), this will really help modeltime stand apart as it has not been tackled elsewhere (in an analytical way) and remains an issue we all have to deal with.
I just wanted to note that there are some other DEA packages that are worth considering to provide more options. I like the first one, only because it just sounds cool to say you leverage yardstick to do benchmarking, but I really have no actual preference for any of the particular DEA packages. 😀
In the simplest terms for those who are unfamiliar with DEA (a non-parametric linear programming model) and are pondering the potential value for inclusion here is a high level explanation. In the context of the time-series problem each model (e.g., ARIMA, Prophet, STL, etc.) would be considered a single 'DMU' (decision-making unit) referred to as an 'input' and each of the time-series error metrics (e.g., RMSE, MAE, MAPE, sMAPE, MASE) are the 'outputs'. So in the 'modeltime' context we have multiple inputs each with multiple outputs. The objective is to evaluate performance so that the best performing model in accordance with all error metrics can be be determined and this can be accomplished by calculating the efficiency ratio (= output / input).
There are really two interrelated points here worth considering adding to the modeltime ecosystem that I think will be potentially valuable contributions to an already excellent package. Importantly, they may help improve the modelling to application pipeline, and I know it is something that has probably given all of us who generate time series forecasts pause on occasion. That is, how to select the best performing model when your candidate models vary across different error metrics. For the sake of example, lets say you have five candidate models, (1) ARIMA, (2) Prophet, (3) DHR, (4) STL, and an (5) ETS, you use the modeltime/tidyverse ecosystem and you generate a modeltime_table(). The table is fantastic but you end up with a challenge. The ARIMA has the lowest MAE, Prophet has the lowest RMSE, DHR has the lowest MAPE, STL has the lowest sMAPE, and ETS has the lowest MASE. Of course, this is an exaggerated case meant to outline the problem that modeltime might consider handling along two potential branches.
I think it is safe to assume among those who will be interested in contributing to modeltime that there is no single error measure that is universally best for all accuracy assessment objectives, and in truth different accuracy measures can yield different conclusions about our models. This leaves us with a difficult decision, which model to pick and how can I justify one or another when these different error metrics suggest differences in performance. Here are the two potential branches.
Implement a ranking procedure for error that establishes what is the best overall performer when considering all measures. There has been much discussion in the broader time-series forecasting literature on this topic, but very little consensus or progress (IMHO), with the notable exception of a Data Envelopment Analysis approach, here : [https://www.sciencedirect.com/science/article/pii/S0040162516301482#bb0160] . Along this vein the associated DEA rank would help identify in an analytic way the most idealized model. This could be tacked on at the end of the modeltime_table().
The second vein, as far as I am aware is not something in the literature and is more of an idea I have in my head that makes intuitive sense, but I may be off base. Consider the example above, now consider making an ensemble model that is based around the top performing models based on each respective error measure. So, think an h2o 'best of family' model for a classification problem where AUC for the best RF, XGBoost, GBM, etc. is selected to make an ensemble, but instead an ensemble is made that considers the range of error measures. This is where it gets tricky and perhaps some simulation work might need to be done as it may be that some weighting regime is ideal for a particular forecasting problems and not others. For example, MAE might be more important than RMSE, I do not really know, but this might be something that could be 'tunable' (and perhaps even automated) where different weights could be given to the associated errors. Ideally this ensemble would be assembled and tuned in an automated way under the hood, again much in the way h2o works.
Apologies for the length of this, but wanted you all to have a complete picture. Both of these I think would be incredibly useful and take some of the guess work out of model selection in these situations that I know we have probably all faced!