Is it possible to add a custom evaluation metric?

kaijennissen commented 4 years ago

Is it possible to add a custom evaluation metric? I am working on an intermittent demand forecast problem and found the MAAPE to be a metric supperior to MAPE and MASE for comparison of point predictions.

PascalIversen commented 4 years ago

Hey @kaijennissen, you can refer to the code @verdimrc provided here. I think for the MAAPE this would be (please double-check the math):

import json
from typing import Dict, Tuple, Union
import numpy as np
import pandas as pd
from gluonts.evaluation import Evaluator
from gluonts.model.forecast import Forecast

from gluonts.dataset.repository.datasets import get_dataset

from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.mx.trainer import Trainer

class MyEvaluator(Evaluator):

    @staticmethod
    def maape(target, forecast):
        denominator = np.abs(target)
        flag = denominator == 0

        maape = np.mean(
            np.arctan((np.abs(target - forecast) * (1 - flag)) / (denominator + flag))
        )
        return maape

    def get_metrics_per_ts(
        self, time_series: Union[pd.Series, pd.DataFrame], forecast: Forecast
    ) -> Dict[str, Union[float, str, None]]:
        metrics = super().get_metrics_per_ts(time_series, forecast)

        pred_target = np.array(self.extract_pred_target(time_series, forecast))
        pred_target = np.ma.masked_invalid(pred_target)
        median_fcst = forecast.quantile(0.5)

        metrics["MAAPE"] = self.maape(
            pred_target, median_fcst
        ) 
        return metrics

    def get_aggregate_metrics(
        self, metric_per_ts: pd.DataFrame
    ) -> Tuple[Dict[str, float], pd.DataFrame]:
        totals, metric_per_ts = super().get_aggregate_metrics(metric_per_ts)

        agg_funs = {
            "MAAPE": "mean",
        }
        assert (
            set(metric_per_ts.columns) >= agg_funs.keys()
        ), "Some of the requested item metrics are missing."
        my_totals = {
            key: metric_per_ts[key].agg(agg) for key, agg in agg_funs.items()
        }

        totals.update(my_totals)
        return totals, metric_per_ts

dataset = get_dataset("m4_hourly", regenerate=True)

estimator = SimpleFeedForwardEstimator(
    num_hidden_dimensions=[10],
    prediction_length=dataset.metadata.prediction_length,
    context_length=100,
    freq=dataset.metadata.freq,
    trainer=Trainer(
        ctx="cpu", epochs=5, learning_rate=1e-5, num_batches_per_epoch=100
    ),
)

predictor = estimator.train(dataset.train)

forecast_it, ts_it = make_evaluation_predictions(
    dataset=dataset.test,  # test dataset
    predictor=predictor,  # predictor
    num_samples=100,  # number of sample paths we want for evaluation
)

forecasts = list(forecast_it)
tss = list(ts_it)

my_evaluator = MyEvaluator(quantiles=[0.1, 0.5, 0.9])
agg_metrics, item_metrics = my_evaluator(
    iter(tss), iter(forecasts), num_series=len(dataset.test)
)

print(json.dumps(agg_metrics, indent=4))

{
    "MSE": 840010519.580835,
    "abs_error": 67564732.9116211,
    "abs_target_sum": 145558863.59960938,
    "abs_target_mean": 7324.822041043147,
    "seasonal_error": 336.9046924038302,
    "MASE": 34.24076941447534,
    "MAPE": 1.0149239948968949,
    "sMAPE": 0.75409841468702,
    "OWA": NaN,
    "MSIS": 400.1144964103775,
    "QuantileLoss[0.1]": 43845412.664644055,
    "Coverage[0.1]": 0.03205515297906603,
    "QuantileLoss[0.5]": 67564732.66281605,
    "Coverage[0.5]": 0.3377113526570052,
    "QuantileLoss[0.9]": 37684940.976142496,
    "Coverage[0.9]": 0.8675523349436396,
    "RMSE": 28982.93497182152,
    "NRMSE": 3.956810801603309,
    "ND": 0.4641746386360385,
    "wQuantileLoss[0.1]": 0.3012211800804532,
    "wQuantileLoss[0.5]": 0.46417463692672967,
    "wQuantileLoss[0.9]": 0.25889829065856784,
    "mean_absolute_QuantileLoss": 49698362.10120087,
    "mean_wQuantileLoss": 0.3414313692219169,
    "MAE_Coverage": 0.08756038647342974,
    "MAAPE": 0.4854620804344365
}

If you are using the MultivariateEvaluator I think you can define class MyMultivariateEvaluator(MultivariateEvaluator) with the same methods as above, but I have not tried that so let me know in case that does not work.

kaijennissen commented 4 years ago

Thanks. That looks promising. I've looked into the code and came up with a generic solution that would only require minor changes in the code. It would require a new arguement custom_eval_fns and then update‚ the metrics and agg_funs dictionaries, f.e. custom_eval_fns = {"MAAPE": [maape_fn, "mean"] , "RMSE": [rmse_fn, "mean"]}

I think this is easier than working with inheritance. I'm happy to contribute the code if you are interested.

lostella commented 4 years ago

@kaijennissen I think that's a very nice idea! One question about that: the statistic next to the function (e.g. "mean") is to be interpreted as the aggregation that needs to be used, or should it indicate what statistic of the predicted distribution should be used?

Because the Evaluator class currently uses e.g. the mean forecast to compute MSE, but the median forecast to compute MAPE. The other dimension is how the metric is aggregated over a dataset: the absolute error is summed, while almost everything else is averaged. I guess it would make sense to specify both in some way?

Another potential issue here: Callable arguments to the Evaluator class will probably not be serializable. This however is easily addressed by using what you propose through subclassing:

class MyEvaluator(Evaluator):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs, custom_eval_fns = {"MAAPE": [maape_fn, "mean"]})

which should work with the current serialization mechanism, provided that you deserialize an object of type MyEvaluator in an environment where MyEvaluator is appropriately defined.

kaijennissen commented 4 years ago

@lostella

@kaijennissen I think that's a very nice idea! One question about that: the statistic next to the function (e.g. "mean") is to be interpreted as the aggregation that needs to be used, or should it indicate what statistic of the predicted distribution should be used?

Because the Evaluator class currently uses e.g. the mean forecast to compute MSE, but the median forecast to compute MAPE. The other dimension is how the metric is aggregated over a dataset: the absolute error is summed, while almost everything else is averaged. I guess it would make sense to specify both in some way?

The statistic should specify how the aggregation of the metric across time series is performed. I've not thought about this before, but it makes sense to add the option to switch between mean and median point predictions. I guess I can add this option too.

Another potential issue here: Callable arguments to the Evaluator class will probably not be serializable. This however is easily addressed by using what you propose through subclassing:
class MyEvaluator(Evaluator):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs, custom_eval_fns = {"MAAPE": [maape_fn, "mean"]})
which should work with the current serialization mechanism, provided that you deserialize an object of type MyEvaluator in an environment where MyEvaluator is appropriately defined.

I'm not quiet sure if I understand how this would solve the problem. Or is this code intended to be used by the user in case he wants to serialize the Evaluator?

lostella commented 4 years ago

Or is this code intended to be used by the user in case he wants to serialize the Evaluator?

Yes, that's what I meant

kaijennissen commented 4 years ago

@lostella I started writing tests, but I'm struggling with a smart way to test whether the mean or median forecast was correctly chosen. Any idea besides comparing mean / median from a deterministic sample?

lostella commented 4 years ago

@lostella I started writing tests, but I'm struggling with a smart way to test whether the mean or median forecast was correctly chosen. Any idea besides comparing mean / median from a deterministic sample?

I think if the mean and median are different, then the accuracy metrics associated with them will likely be different. Or am I missing somethig? What kind of test do you have in mind? I'm thinking about "given a forecast & ground truth, assert that metrics evaluate to this and that value"

kaijennissen commented 4 years ago

I was thinking more about the technical part. Currently the tests inside test_evaluator.py use the naive_forecast function so that the samples have zero variance and therefore mean and median are equal. Maybe you are aware of another function which returns deterministic forecasts where the variance is not zero.

lostella commented 4 years ago

@kaijennissen We could use something that outputs the latest, say, 100 observations as samples for each time step in the prediction interval: this would be like assuming independent data point, and outputting the empirical CDF as predicted distribution.

However, changing the "model" there would require re-writing some test cases I guess. Maybe it's fine to keep using naive_forecast for now, and then separately think about how to improve the test script in general, what do you think?

lostella commented 4 years ago

@kaijennissen thanks for bringing this up, and for fixing this!

awslabs / gluonts

Is it possible to add a custom evaluation metric? #1106