Which metric should I use to choose the best model?

nescobar commented 4 years ago

I'm using Prophet with MLflow on Spark to run hundreds of models. I've found in many cases there are models with the lowest RMSE but not the lowest MAPE. There also cases when MAPE at one day is the best but not MAPE at 30 days. Then, there is the MAPE average across all days where the best could also be different from others.

Here is an example from three runs:

The question is, is there any criteria to select the best metric to choose the best model?

bletham commented 4 years ago

I don't know that there's a "right" answer, and it probably depends on what the metrics are ultimately used for.

MAPE has an advantage in being interpretable (people have a general idea of what 5% error means). But it does have an issue that it doesn't work well for time series with values close to 0, where a very small difference can turn into a huge % error.

RMSE is nice because it doesn't have that problem (it is translation invariant, and it doesn't matter if values are close to 0). But it has downsides too; I think the interpretability is lower (if you tell someone the forecast has RMSE of 3.5, do they have a sense for what that means?). Also, unlike MAPE, it isn't scale invariant. So time series that take values in different ranges might have very different RMSEs just because of the differences in scale, which makes it more difficult to understand performance across time series.

Every metric has some issue like that. There are many posts online comparing performance metrics and ultimately it will depend on the properties of your time series and how you want to use the metric.

nescobar commented 4 years ago

Thanks @bletham, I was aware of some of these differences and like to use MAPE most of the time, if possible.

However, my confusion is in relation to the difference in performance across metrics. I'm expecting the best models for the same time series to be the best across all metrics. For example, the best model using RMSE, shouldn't it be the best using MAPE as well?

benmwhite commented 4 years ago

@nescobar MAPE scales by the actual values before averaging them up, RMSE averages squared values, and MAE averages errors in the original units. Since the scalings are all happening before the averaging we can come up with some squirrely examples.

Ex: Your actual values are 0.5 and 1.5. If you predict 1 for both then your error terms are

e_1 = |0.5 - 1| = 1/2 
e_2 = |1.5 - 1| = 1/2
e_1^2 = e_2^2 = 1/4

aggregated error metrics:

RMSE = sqrt(mean([1/4, 1/4])) = sqrt(1/4) 
MAPE = mean([0.5/0.5, 0.5/1.5]) = mean([1, 1/3]) = 2/3
MAE = mean([1/2, 1/2]) = 1/2

If you predict 0.75 for both:

e_1 = |0.5 - 0.75| = 1/4
e_2 = |1.5 - 0.75| = 3/4
e_1^2 = 1/16
e_2^2 = 9/16

aggregated error metrics:

RMSE = sqrt(mean([1/16, 9/16])) = sqrt(5/16)
MAPE = mean([0.25/0.5, 0.75/1.5])) = mean([1/2, 1/2]) = 1/2
MAE = mean([1/4, 3/4]) = 1/2

Using RMSE the first prediction has lower error, using MAPE the second prediction has lower error, and using MAE the errors are identical.

EDIT: formatting, added MAE to the example

facebook / prophet

Which metric should I use to choose the best model? #1341