Nixtla / utilsforecast

https://nixtlaverse.nixtla.io/utilsforecast
Apache License 2.0
42 stars 6 forks source link

MAPE calculation might need an update #86

Closed iamyihwa closed 3 months ago

iamyihwa commented 3 months ago

Hello, Current MAPE calculation discounts too much the error, when true value is 0.

In the below example, among 13 datapoints, only one of the true value is non-zero. In the current error calculation, MAPE is set to 0 whenever true value is 0, therefore the error is calculated on an array of this form [0,0,0,0, X, .. , 0, 0, ] , which is definitely an underestimation of true error.

I do understand that when the denominator is 0, the error is not defined, however, still might be better to use some other methods. (e.g. only calculating error for those where true value is not zero, or adding a small number to the denominator? )

df = pd.DataFrame({'ds': ['2023-12-30T00:00:00.000000', '2023-12-20T00:00:00.000000',
       '2024-12-23T00:00:00.000000', '2023-12-04T00:00:00.000000',
       '2024-12-27T00:00:00.000000', '2024-12-20T00:00:00.000000',
       '2023-12-07T00:00:00.000000', '2023-12-09T00:00:00.000000',
       '2024-12-24T00:00:00.000000', '2023-12-12T00:00:00.000000',
       '2023-12-02T00:00:00.000000', '2023-12-17T00:00:00.000000',
       '2023-12-15T00:00:00.000000'], 'y':[ 0.,  0.,  0.,  0.,  0.,  0.,  0., 74.,  0.,  0.,  0.,  0.,  0.], 
                   'model_1': [301.92727857,  42.21299589, 250.2662998 ,  37.2813642 ,
       280.4204068 , 232.06773359,  36.23961508,  57.4852432 ,
       265.91963952,  33.2913518 , 320.13083426,  39.60095495,
        34.05477703] })
df['unique_id'] = '1'

## In the process of MAPE calculation, all the values that have 0 denominators are to NaN, then the value is subsequently ## is set to 0, 
## then mean is applied, which would create lots of 0 errors in the rows. 
 res = (
            df[models]
            .sub(df[target_col], axis=0)
            .abs()
            .div(_zero_to_nan(df[target_col].abs()), axis=0)
            .fillna(0)
            .groupby(df[id_col], observed=True)
            .mean()
        )

from utilsforecast.losses import mape 
mape(df, models = ['model_1'])

Getting 0.017167 as error.

Whereas if I use a filter to get only those that are not divided by 0 (not infinite),

ape_arr = np.abs((df['model_1'] - df['y'] ) / df['y'] )
finite_mask = np.isfinite(ape_arr )
mean_value = np.mean(ape_arr[finite_mask])
print(mean_value)

I get an error of 0.223.

I agree that both are not complete, but setting error to 0, for the cases when true value is 0, might bias the error too much.

jmoralez commented 3 months ago

Hey @iamyihwa, thanks for raising this. Makes sense, I believe that if we remove the fillna step we'll compute the errors only on the values that weren't zero. I'll work on changing it.