JoaquinAmatRodrigo / skforecast

Time series forecasting with machine learning models
https://skforecast.org
BSD 3-Clause "New" or "Revised" License
1k stars 113 forks source link

Interpretation of graph outputs #205

Closed kaionwong closed 1 year ago

kaionwong commented 1 year ago

Can someone help explain or point me to how to better interpret these graphs?

Screenshot_4

I am using XGBRegressor classifier to make prediction in the test period, with each time unit be making a 4-week prediction per step. What can explain the large spikes between Aug and Nov?

Screenshot_5

Similarly with the prediction interval, what accounts for these large spikes (I set it as interval=[5, 95],), and why the interval only extends above the predicted values?


Edits

The problem space is to predict the number of case count (number of occurrences of an event) in a particular location during the following period. The case count time series (the time unit is weekly) is split into training, validation, and test sets.

image

Sample data is as follows:

            YEAR  EPIWEEK YEAR_WEEK ADM1_NAME  REPORTED_CASES  BEGINNING    ENDING       DATE \
DATE_INDEX                                                                                                                                                              
2017-01-07  2017        1   2017 W          A            89.0   20170101  20170107 2017-01-07 
2017-01-14  2017        2   2017 W2         A            84.0   20170108  20170114 2017-01-14 
2017-01-21  2017        3   2017 W3         A            87.0   20170115  20170121 2017-01-21
2017-01-28  2017        4   2017 W4         A            51.0   20170122  20170128 2017-01-28
2017-02-04  2017        5   2017 W5         A            41.0   20170129  20170204 2017-02-04 

            POPULATION_2020  POPULATION_DENSITY X_AVG_PREC_STD_SCALER  X_AVG_TEMP_STD_SCALER  X_AVG_REL_HUM_STD_SCALER  \
DATE_INDEX                                                                                                                                                       
2017-01-07          8348151          106.215978             -0.730876              -1.459063                 -0.661750   
2017-01-14          8348151          106.215978             -0.730876              -1.459063                 -0.661750   
2017-01-21          8348151          106.215978             -0.730876              -1.459063                 -0.661750   
2017-01-28          8348151          106.215978             -0.730876              -1.459063                 -0.661750   
2017-02-04          8348151          106.215978             -0.727494              -1.654968                 -0.711259   

            X_AVG_SP_HUM_STD_SCALER X_MONTH  X_MA_60    X_MA_30  X_MA_2  X_SMOOTH_MA_2 
DATE_INDEX                                                                                                                                          
2017-01-07                -0.960500      01   201.55  97.433333    86.5            NaN 
2017-01-14                -0.960500      01   201.55  97.433333    86.5           86.5  
2017-01-21                -0.960500      01   201.55  97.433333    85.5           85.5  
2017-01-28                -0.960500      01   201.55  97.433333    69.0           69.0   
2017-02-04                -1.036021      02   201.55  97.433333    46.0           46.0

The test period is between 2020-07-01 to 2021-08-21. If I make a 4-week rolling prediction (future_prediction_n_step = 4), it looks like this

image

If I make a 8-week rolling prediction (future_prediction_n_step = 8), it looks like this

image

My questions are: 1) Why are there spikes (huge ups and huge downs) that should not be explained by seasonality as the seasonality is clearly annual and not within a year? 2) When the trained algo is producing the 4-week forecast, will it not "see" the test data at all? Or will the algo be exposed to the actual test data in a rolling fashion? In other words, will the algo be using actual data in the test data set (2020-07-01 to 2021-08-21) during its prediction in the test set, or will it only be based on actual data prior to 2020-07-01 and the predicted values during the test period?

My code

import sys
import math
import pandas as pd
import config
import data_import
import feature_engineering as feat_eng
import glob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import hts
import correlation as corr
from lightgbm import LGBMRegressor
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregMultiOutput import ForecasterAutoregMultiOutput
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster

# Modelling and Forecasting
# ==============================================================================
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregMultiOutput import ForecasterAutoregMultiOutput
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster

from joblib import dump, load

# Print/graph output setting
save_switch = False  # WARNING: Will overwrite existing files with the same filename
print_switch = True
graph_switch = True
sole_graph_switch = False
if sole_graph_switch:
    graph_switch = False  # if sole_graph_switch == `True`, automatically disable `graph_switch`
prediction_interval_switch = True
log_transform_outcome = False
# Forecasting parameters
autoregressive_n_lag = 104  # number of lags (weeks) used for autoregression
horizontal_threshold_n_case = 100  # this horizontal_threshold_n_case is nullified if set to 0
curr_random_state = 888
lags_grid = [1, 4, 8, 26, 55, 120]   # lags used as autoregressive predictor
curr_performance_metric = 'mean_absolute_error'  # 'mean_absolute_error' # 'mean_squared_error' # 'mean_absolute_percentage_error'
selected_predictor_option_set = 1
param_grid_option = 0  # search grid for hyperparameter optimization
# Moving average setting
slow_ma_n_wk = 60
medium_ma_n_wk = 30
fast_ma_n_wk = 2
smoothing_ma_n_wk = 2
# Date setting
outer_start_date = '2017-01-07'  # '2017-01-07'
outer_end_date = '2021-08-21'  # '2021-08-21'
end_train = '2019-06-01'  # '2019-02-28'
end_validation = '2020-07-01'
restrict_outer_date_switch = True  # if `True`, apply the `outer_start_date` and `outer_end_date` to slice and form the "total" data
# Climate lag setting
avg_precipitation_lag = 0  # positive value, i.e., +4, means the current value will move 4 weeks into the future
avg_temp_lag = 0
avg_relative_humidity_lag = 0
avg_specific_humidity_lag = 0
climate_lag_switch = True  # if `True`, apply the above lags to the corresponding climate var
climate_moving_avg_n_wk = 4
climate_moving_avg_switch = True  # if `True`, turn the climate variables into smooth average

def log_transform(curr_num):
    if curr_num > 0:
        return math.log(curr_num)
    else:
        return curr_num

def main(location, future_prediction_n_step):
    # Setup
    # =====================================================================================================
    # Initial setup
    plt.cla()

    # Data import and preprocessing
    df = data_import.import_single_country_ref_data(input_filename=data_filename)
    df = df.asfreq('W-Sat')  # this line isn't really necessary
    df = df.sort_index()
    assert len(df) > 0, 'Error: DataFrame has no data'

    # Feature set
    # =====================================================================================================
    # Log transformation of outcome (case count)
    if log_transform_outcome:
        df['REPORTED_CASES'] = df['REPORTED_CASES'].apply(log_transform)

    # Data preparation for predictive modeling
    feature_column = ['AVG_PREC', 'AVG_TEMP', 'AVG_REL_HUM', 'AVG_SP_HUM']
    for col in feature_column:
        df.rename(columns={col: 'X_' + col}, inplace=True)  # signal which are predictors

    # Standardize the features
    tagged_feature_column = ['X_AVG_PREC', 'X_AVG_TEMP', 'X_AVG_REL_HUM', 'X_AVG_SP_HUM']
    std_scaler = StandardScaler()
    for col in tagged_feature_column:
        df[col + '_STD_SCALER'] = std_scaler.fit_transform(df[[col]])
    df.drop(tagged_feature_column, axis=1, inplace=True)  # remove unecessary columns

    # Feature engineering - extract month
    df['X_MONTH'] = df['DATE_INDEX_COPY'].dt.strftime('%m')
    df['X_MONTH'] = df['X_MONTH'].astype('category')

    # Feature engineering - moving averages
    df[f'X_MA_{slow_ma_n_wk}'] = df['REPORTED_CASES'].rolling(window=slow_ma_n_wk).mean()
    df[f'X_MA_{medium_ma_n_wk}'] = df['REPORTED_CASES'].rolling(window=medium_ma_n_wk).mean()
    df[f'X_MA_{fast_ma_n_wk}'] = df['REPORTED_CASES'].rolling(window=fast_ma_n_wk).mean()
    df[f'X_SMOOTH_MA_{smoothing_ma_n_wk}'] = df['REPORTED_CASES'].rolling(window=smoothing_ma_n_wk).mean()

    # Feature engineering - case proportion
    df['CASE_PER_100K_POPULATION'] = df[f'X_SMOOTH_MA_{smoothing_ma_n_wk}'] / df['POPULATION_2020'] * 100000

    # Add the REPORTED_CASES time-series (imputed missing) of the adjacent states and map to `df`
    adjacent_adm1_list = config.adjacent_states_per_adm1[location]
    adjacent_adm1_varname_list = []

    for curr_adm1 in adjacent_adm1_list:
        adm1_varname = f'X_REPORTED_CASES_{curr_adm1.upper()}'
        adjacent_adm1_varname_list.append(adm1_varname)
        df_adjacent_adm1 = df_mexico_full[df_mexico_full['ADM1_NAME'] == curr_adm1]
        df_adjacent_adm1 = df_adjacent_adm1[['REPORTED_CASES']]
        df = df.merge(df_adjacent_adm1, left_index=True, right_index=True)
        df.rename({'REPORTED_CASES_y': adm1_varname, 'REPORTED_CASES_x': 'REPORTED_CASES'}, axis=1, inplace=True)

    # Apply lags (in weeks) for climate variables
    if climate_lag_switch:
        df['X_AVG_PREC_STD_SCALER'] = df['X_AVG_PREC_STD_SCALER'].rolling(window=climate_moving_avg_n_wk).mean()
        df['X_AVG_TEMP_STD_SCALER'] = df['X_AVG_TEMP_STD_SCALER'].rolling(window=climate_moving_avg_n_wk).mean()
        df['X_AVG_REL_HUM_STD_SCALER'] = df['X_AVG_REL_HUM_STD_SCALER'].rolling(window=climate_moving_avg_n_wk).mean()
        df['X_AVG_SP_HUM_STD_SCALER'] = df['X_AVG_SP_HUM_STD_SCALER'].rolling(window=climate_moving_avg_n_wk).mean()

    # Predictor groups
    climate_varname_list = ['X_AVG_PREC_STD_SCALER', 'X_AVG_TEMP_STD_SCALER', 'X_AVG_REL_HUM_STD_SCALER',
                            'X_AVG_SP_HUM_STD_SCALER']
    ma_varname_list = [f'X_MA_{slow_ma_n_wk}', f'X_MA_{medium_ma_n_wk}', f'X_MA_{fast_ma_n_wk}']

    # Turn climate variables into their respective moving averages
    if climate_moving_avg_switch:
        df['X_AVG_PREC_STD_SCALER'] = df['X_AVG_PREC_STD_SCALER'].shift(avg_precipitation_lag)
        df['X_AVG_TEMP_STD_SCALER'] = df['X_AVG_TEMP_STD_SCALER'].shift(avg_temp_lag)
        df['X_AVG_REL_HUM_STD_SCALER'] = df['X_AVG_REL_HUM_STD_SCALER'].shift(avg_relative_humidity_lag)
        df['X_AVG_SP_HUM_STD_SCALER'] = df['X_AVG_SP_HUM_STD_SCALER'].shift(avg_specific_humidity_lag)

    # Selection of predictor combinations
    if selected_predictor_option_set == 1:
        exogenous_varname_list = []
    elif selected_predictor_option_set == 2:
        exogenous_varname_list = climate_varname_list
    elif selected_predictor_option_set == 3:
        exogenous_varname_list = adjacent_adm1_varname_list
    elif selected_predictor_option_set == 4:
        exogenous_varname_list = ma_varname_list
    elif selected_predictor_option_set == 5:
        exogenous_varname_list = climate_varname_list + adjacent_adm1_varname_list
    elif selected_predictor_option_set == 6:
        exogenous_varname_list = climate_varname_list + ma_varname_list + adjacent_adm1_varname_list

    # Impute missing values in Exogenous Vars; using last available value
    for varname in (climate_varname_list + ma_varname_list + adjacent_adm1_varname_list):
        df[varname] = df[varname].ffill()  # backward fill - impute using the closest past value
        df[varname] = df[varname].bfill()  # forward fill - inpute using the cloest future value

    # Dataset splitting
    # =====================================================================================================
    # Splitting datasets into train-val-test
    if restrict_outer_date_switch:
        df = df.loc[outer_start_date: outer_end_date]
    df_train = df.loc[: end_train, :]
    df_val = df.loc[end_train:end_validation, :]
    df_test = df.loc[end_validation:, :]

    # Time series plot
    if graph_switch:
        fig, ax = plt.subplots(figsize=(12, 4))
        df_train.REPORTED_CASES.plot(ax=ax, label='train', linewidth=1)
        df_val.REPORTED_CASES.plot(ax=ax, label='validation', linewidth=1)
        df_test.REPORTED_CASES.plot(ax=ax, label='test', linewidth=1)
        ax.set_title(f'Reported cases in {location}')
        ax.legend()

        if log_transform_outcome:
            plt.ylabel('log scale')

        plt.show()

    # Boxplot for annual seasonality
    if graph_switch:
        fig, ax = plt.subplots(figsize=(7, 3.5))
        df['MONTH'] = df.index.month  # WARNING: coarse way to extract month, testing for now
        df.boxplot(column='REPORTED_CASES', by='MONTH', ax=ax, )
        df.groupby('MONTH')['REPORTED_CASES'].median().plot(style='o-', linewidth=0.8, ax=ax)
        ax.set_ylabel('Reported cases')
        ax.set_title(f'Reported cases in {location} by month')
        fig.suptitle('')

        plt.show()

    # Autocorrelation plot
    if graph_switch:
        fig, ax = plt.subplots(figsize=(7, 3))
        plot_acf(df.REPORTED_CASES, ax=ax, lags=180)
        plt.show()

    # Partial autocorrelation plot
    if graph_switch:
        fig, ax = plt.subplots(figsize=(7, 3))
        plot_pacf(df.REPORTED_CASES, ax=ax, lags=112, method='ywm')
        plt.show()

    if print_switch:
        print(f"Train dates : {df_train.index.min()} --- {df_train.index.max()}")
        print(f"Validation dates : {df_val.index.min()} --- {df_val.index.max()}")
        print(f"Test dates : {df_test.index.min()} --- {df_test.index.max()}")
        print(f"Mean outcome: {df['REPORTED_CASES'].mean()}")

    # Forecasting classifier set up (without grid search)
    # =====================================================================================================
    print('=====================================================================================================')
    print('Start of forecasting (without grid search)')

    # Initialize forecaster
    forecaster = ForecasterAutoreg(
        regressor=XGBRegressor(random_state=curr_random_state),
        lags=autoregressive_n_lag
    )

    # Declare predictors and outcome
    forecaster.fit(
        y=df.loc[:end_validation, 'REPORTED_CASES'],
        exog=df.loc[:end_validation, exogenous_varname_list]
    )

    # Backtest
    metric, predictions = backtesting_forecaster(
        forecaster=forecaster,
        y=df.REPORTED_CASES,
        initial_train_size=len(df.loc[:end_validation]),
        fixed_train_size=False,
        steps=future_prediction_n_step,
        metric=curr_performance_metric,
        refit=False,
        verbose=False
    )

    # Plot backtesting and print results
    if graph_switch:
        fig, ax = plt.subplots(figsize=(12, 3.5))
        df.loc[predictions.index, 'REPORTED_CASES'].plot(ax=ax, linewidth=2, label='real')
        predictions.plot(linewidth=2, label='prediction', ax=ax)
        ax.set_title(f'Prediction vs real reported cases in {location} ({future_prediction_n_step}-week)')
        ax.legend()

        if log_transform_outcome:
            plt.ylabel('log scale')

        plt.show()

    if print_switch:
        print(f'Backtest error metric: {metric}')
        print('Predictor importance:', forecaster.get_feature_importance())

    print('End of forecasting (without grid search)')

    # Forecasting classifier set up (with grid search)
    # =====================================================================================================
    print('=====================================================================================================')
    print('Start of forecasting (with grid search)')

    # Hyperparameter Grid search
    forecaster = ForecasterAutoreg(
        regressor=XGBRegressor(random_state=curr_random_state),
        lags=autoregressive_n_lag  # This value will be replaced in the grid search
    )

    # Regressor's hyperparameters
    if param_grid_option == 0:
        param_grid = {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, 30],
            'learning_rate': [0.01, 0.1]
        }

    elif param_grid_option == 1:
        param_grid = {
            'n_estimators': [10, 50, 100, 250, 500],
            'max_depth': [3, 5, 10, 25, 30, 45],
            'learning_rate': [0.01, 0.05, 0.1]
        }

    elif param_grid_option == 2:
        param_grid = {
            'n_estimators': [10, 25, 50, 100, 250, 500, 1000],
            'max_depth': [3, 5, 8, 10, 25, 30, 45],
            'learning_rate': [0.005, 0.01, 0.05, 0.1, 0.3]
        }

    results_grid = grid_search_forecaster(
        forecaster=forecaster,
        y=df.loc[:end_validation, 'REPORTED_CASES'],
        exog=df.loc[:end_validation, exogenous_varname_list],
        param_grid=param_grid,
        lags_grid=lags_grid,
        steps=future_prediction_n_step,
        metric=curr_performance_metric,
        refit=False,
        initial_train_size=len(df[:end_train]),
        fixed_train_size=False,
        return_best=True,
        verbose=False
    )

    if print_switch:
        print(f'Grid optimization results: {results_grid}')
        print(f'Trainer attributes: {forecaster}')

    # Backtest with test data and prediction intervals
    metric, predictions = backtesting_forecaster(
        forecaster=forecaster,
        y=df.REPORTED_CASES,
        initial_train_size=len(df.REPORTED_CASES[:end_validation]),
        fixed_train_size=False,
        steps=future_prediction_n_step,
        metric=curr_performance_metric,
        interval=[5, 95],
        n_boot=500,
        in_sample_residuals=True,
        verbose=False
    )

    if print_switch:
        print('Backtesting error metric:', metric)
        predictions.head(5)

    if graph_switch:
        fig, ax = plt.subplots(figsize=(12, 3.5))
        df.loc[predictions.index, 'REPORTED_CASES'].plot(linewidth=2, label='real', ax=ax)
        ax.set_title(f'Prediction vs real reported cases in {location} ({future_prediction_n_step}-week)')
        predictions.iloc[:, 0].plot(linewidth=2, label='prediction', ax=ax)
        if prediction_interval_switch:
            ax.fill_between(
                predictions.index,
                predictions.iloc[:, 1],
                predictions.iloc[:, 2],
                alpha=0.3,
                color='red',
                label='prediction interval'
            )
        ax.legend()

        if log_transform_outcome:
            plt.ylabel('log scale')

        plt.show()

    if sole_graph_switch:  # mapping the predicted values to the entire time-series
        if prediction_interval_switch:
            fig, ax = plt.subplots(figsize=(18, 6))
            df_train['REPORTED_CASES'].plot(ax=ax, label='train')
            df_val['REPORTED_CASES'].plot(ax=ax, label='validation')
            df_test['REPORTED_CASES'].plot(ax=ax, label='test')
            predictions.plot(ax=ax, label='pred')
            ax.set_title(f'Prediction vs real reported cases in {location} ({future_prediction_n_step}-week)')
            ax.legend()

            if log_transform_outcome:
                plt.ylabel('log scale')

            if save_switch:
                plt.savefig(
                    output_dir + f'/predicted_graph_{location}_{future_prediction_n_step}week_with_interval.png',
                    dpi=300)

            else:
                plt.show()

        else:
            fig, ax = plt.subplots(figsize=(18, 6))
            df_train['REPORTED_CASES'].plot(ax=ax, label='train')
            df_val['REPORTED_CASES'].plot(ax=ax, label='validation')
            df_test['REPORTED_CASES'].plot(ax=ax, label='test')
            predictions.iloc[:, 0].plot(ax=ax, label='pred')
            ax.set_title(f'Prediction vs real reported cases in {location} ({future_prediction_n_step}-week)')
            ax.legend()

            if log_transform_outcome:
                plt.ylabel('log scale')

            if save_switch:
                plt.savefig(output_dir + f'/predicted_graph_{location}_{future_prediction_n_step}week.png', dpi=300)

            else:
                plt.show()

    # Predicted interval coverage
    inside_interval = np.where(
        (df.loc[end_validation:, 'REPORTED_CASES'] >= predictions["lower_bound"]) & \
        (df.loc[end_validation:, 'REPORTED_CASES'] <= predictions["upper_bound"]),
        True,
        False
    )

    coverage = inside_interval.mean()

    if print_switch:
        print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")
        print('Predictor importance:', forecaster.get_feature_importance())

if __name__ == '__main__':
    loop_location_switch = False # can be 'True' or 'False'

    if loop_location_switch == True:
        all_adm1_list = config.adjacent_states_per_adm1.keys()
        n_forecast_week_list = [12, 16, 20, 24, 32, 52]

        for adm1 in all_adm1_list:
            for n_week in n_forecast_week_list:
                main(location=adm1, future_prediction_n_step=n_week)

    else:
        adm1 = config.selected_adm1
        main(location=adm1, future_prediction_n_step=24)
JavierEscobarOrtiz commented 1 year ago

Hello @kaionwong,

Without seeing the code it is a bit complicated to imagine the problem and answer the questions. The spikes can be caused by seasonality factors in the past behavior, but this is not always the reason.

if you provide us with more information we may be able to help you.

Javier

spike8888 commented 1 year ago

I am also interested in the answer to this question. I have sometimes similiar issues.

kaionwong commented 1 year ago

@JavierEscobarOrtiz I have updated in my OP. Thanks.

JoaquinAmatRodrigo commented 1 year ago

It has been 60 days since the last activity on this GitHub issue. Since we have not received any updates or progress reports, we will be closing this issue.

If this issue is still relevant and requires attention, please feel free to reopen it and provide an update. We appreciate your contributions and would love to see this issue resolved if it is still relevant.

Thank you for your participation and cooperation!