Closed kaijennissen closed 4 years ago
Hey @kaijennissen, you can refer to the code @verdimrc provided here. I think for the MAAPE this would be (please double-check the math):
import json
from typing import Dict, Tuple, Union
import numpy as np
import pandas as pd
from gluonts.evaluation import Evaluator
from gluonts.model.forecast import Forecast
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.mx.trainer import Trainer
class MyEvaluator(Evaluator):
@staticmethod
def maape(target, forecast):
denominator = np.abs(target)
flag = denominator == 0
maape = np.mean(
np.arctan((np.abs(target - forecast) * (1 - flag)) / (denominator + flag))
)
return maape
def get_metrics_per_ts(
self, time_series: Union[pd.Series, pd.DataFrame], forecast: Forecast
) -> Dict[str, Union[float, str, None]]:
metrics = super().get_metrics_per_ts(time_series, forecast)
pred_target = np.array(self.extract_pred_target(time_series, forecast))
pred_target = np.ma.masked_invalid(pred_target)
median_fcst = forecast.quantile(0.5)
metrics["MAAPE"] = self.maape(
pred_target, median_fcst
)
return metrics
def get_aggregate_metrics(
self, metric_per_ts: pd.DataFrame
) -> Tuple[Dict[str, float], pd.DataFrame]:
totals, metric_per_ts = super().get_aggregate_metrics(metric_per_ts)
agg_funs = {
"MAAPE": "mean",
}
assert (
set(metric_per_ts.columns) >= agg_funs.keys()
), "Some of the requested item metrics are missing."
my_totals = {
key: metric_per_ts[key].agg(agg) for key, agg in agg_funs.items()
}
totals.update(my_totals)
return totals, metric_per_ts
dataset = get_dataset("m4_hourly", regenerate=True)
estimator = SimpleFeedForwardEstimator(
num_hidden_dimensions=[10],
prediction_length=dataset.metadata.prediction_length,
context_length=100,
freq=dataset.metadata.freq,
trainer=Trainer(
ctx="cpu", epochs=5, learning_rate=1e-5, num_batches_per_epoch=100
),
)
predictor = estimator.train(dataset.train)
forecast_it, ts_it = make_evaluation_predictions(
dataset=dataset.test, # test dataset
predictor=predictor, # predictor
num_samples=100, # number of sample paths we want for evaluation
)
forecasts = list(forecast_it)
tss = list(ts_it)
my_evaluator = MyEvaluator(quantiles=[0.1, 0.5, 0.9])
agg_metrics, item_metrics = my_evaluator(
iter(tss), iter(forecasts), num_series=len(dataset.test)
)
print(json.dumps(agg_metrics, indent=4))
{
"MSE": 840010519.580835,
"abs_error": 67564732.9116211,
"abs_target_sum": 145558863.59960938,
"abs_target_mean": 7324.822041043147,
"seasonal_error": 336.9046924038302,
"MASE": 34.24076941447534,
"MAPE": 1.0149239948968949,
"sMAPE": 0.75409841468702,
"OWA": NaN,
"MSIS": 400.1144964103775,
"QuantileLoss[0.1]": 43845412.664644055,
"Coverage[0.1]": 0.03205515297906603,
"QuantileLoss[0.5]": 67564732.66281605,
"Coverage[0.5]": 0.3377113526570052,
"QuantileLoss[0.9]": 37684940.976142496,
"Coverage[0.9]": 0.8675523349436396,
"RMSE": 28982.93497182152,
"NRMSE": 3.956810801603309,
"ND": 0.4641746386360385,
"wQuantileLoss[0.1]": 0.3012211800804532,
"wQuantileLoss[0.5]": 0.46417463692672967,
"wQuantileLoss[0.9]": 0.25889829065856784,
"mean_absolute_QuantileLoss": 49698362.10120087,
"mean_wQuantileLoss": 0.3414313692219169,
"MAE_Coverage": 0.08756038647342974,
"MAAPE": 0.4854620804344365
}
If you are using the MultivariateEvaluator
I think you can define class MyMultivariateEvaluator(MultivariateEvaluator)
with the same methods as above, but I have not tried that so let me know in case that does not work.
Thanks. That looks promising.
I've looked into the code and came up with a generic solution that would only require minor changes in the code.
It would require a new arguement custom_eval_fns
and then update‚ the metrics
and agg_funs
dictionaries, f.e.
custom_eval_fns = {"MAAPE": [maape_fn, "mean"] , "RMSE": [rmse_fn, "mean"]}
I think this is easier than working with inheritance. I'm happy to contribute the code if you are interested.
@kaijennissen I think that's a very nice idea! One question about that: the statistic next to the function (e.g. "mean"
) is to be interpreted as the aggregation that needs to be used, or should it indicate what statistic of the predicted distribution should be used?
Because the Evaluator
class currently uses e.g. the mean forecast to compute MSE, but the median forecast to compute MAPE. The other dimension is how the metric is aggregated over a dataset: the absolute error is summed, while almost everything else is averaged. I guess it would make sense to specify both in some way?
Another potential issue here: Callable
arguments to the Evaluator
class will probably not be serializable. This however is easily addressed by using what you propose through subclassing:
class MyEvaluator(Evaluator):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs, custom_eval_fns = {"MAAPE": [maape_fn, "mean"]})
which should work with the current serialization mechanism, provided that you deserialize an object of type MyEvaluator
in an environment where MyEvaluator
is appropriately defined.
@lostella
@kaijennissen I think that's a very nice idea! One question about that: the statistic next to the function (e.g.
"mean"
) is to be interpreted as the aggregation that needs to be used, or should it indicate what statistic of the predicted distribution should be used?Because the
Evaluator
class currently uses e.g. the mean forecast to compute MSE, but the median forecast to compute MAPE. The other dimension is how the metric is aggregated over a dataset: the absolute error is summed, while almost everything else is averaged. I guess it would make sense to specify both in some way?
The statistic should specify how the aggregation of the metric across time series is performed. I've not thought about this before, but it makes sense to add the option to switch between mean and median point predictions. I guess I can add this option too.
Another potential issue here:
Callable
arguments to theEvaluator
class will probably not be serializable. This however is easily addressed by using what you propose through subclassing:class MyEvaluator(Evaluator): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs, custom_eval_fns = {"MAAPE": [maape_fn, "mean"]})
which should work with the current serialization mechanism, provided that you deserialize an object of type
MyEvaluator
in an environment whereMyEvaluator
is appropriately defined.
I'm not quiet sure if I understand how this would solve the problem. Or is this code intended to be used by the user in case he wants to serialize the Evaluator?
Or is this code intended to be used by the user in case he wants to serialize the Evaluator?
Yes, that's what I meant
@lostella I started writing tests, but I'm struggling with a smart way to test whether the mean or median forecast was correctly chosen. Any idea besides comparing mean / median from a deterministic sample?
@lostella I started writing tests, but I'm struggling with a smart way to test whether the mean or median forecast was correctly chosen. Any idea besides comparing mean / median from a deterministic sample?
I think if the mean and median are different, then the accuracy metrics associated with them will likely be different. Or am I missing somethig? What kind of test do you have in mind? I'm thinking about "given a forecast & ground truth, assert that metrics evaluate to this and that value"
I was thinking more about the technical part. Currently the tests inside test_evaluator.py
use the naive_forecast
function so that the samples have zero variance and therefore mean and median are equal.
Maybe you are aware of another function which returns deterministic forecasts where the variance is not zero.
@kaijennissen We could use something that outputs the latest, say, 100 observations as samples for each time step in the prediction interval: this would be like assuming independent data point, and outputting the empirical CDF as predicted distribution.
However, changing the "model" there would require re-writing some test cases I guess. Maybe it's fine to keep using naive_forecast
for now, and then separately think about how to improve the test script in general, what do you think?
@kaijennissen thanks for bringing this up, and for fixing this!
Is it possible to add a custom evaluation metric? I am working on an intermittent demand forecast problem and found the MAAPE to be a metric supperior to MAPE and MASE for comparison of point predictions.