In-sample predictions are not back transformed in ForecasterAutoreg()

schoulten commented 2 weeks ago

The problem

In order to generate in-sample predictions (aka fitted values), you need to create training matrices with .create_train_X_y() and use it with .predict() in the internal regressor, as described in the docs. But when any transformation is given to the ForecasterAutoreg(), it appears that the in-sample predictions are not being reverted to the original scale of the data.

Is it something that I missing or there is a way to revert the transformation?

Reproducible example

# Libraries
# ==============================================================================
import pandas as pd
import numpy as np
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PowerTransformer

# Download data
# ==============================================================================
data = fetch_dataset(name = "h2o_exog", raw = False)["y"]

# Split train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

# Plot
# ==============================================================================
pd.concat([data_train.rename("train"), data_test.rename("test")], axis = "columns").plot()

# Create and fit forecaster without transformer
# ==============================================================================
forecaster_notrans = ForecasterAutoreg(
    regressor = RandomForestRegressor(random_state = 123),
    lags = 12
    )

forecaster_notrans.fit(y = data_train)

forecaster_notrans

# Create training matrices
# ==============================================================================
X_train, y_train = forecaster_notrans.create_train_X_y(data_train)
X_train.head()

# Predict using the internal regressor
# ==============================================================================
predictions_1 = forecaster_notrans.regressor.predict(X_train)
predictions_1[:4]

# Plot predictions
# ==============================================================================
pd.concat([
    pd.Series(predictions_1, name = "fitted", index = data_train.index[forecaster_notrans.max_lag:]),
    data_train.rename("train")
], axis = "columns").plot(title = "No transformer");

# Create and fit forecaster with transformer
# ==============================================================================
forecaster_trans = ForecasterAutoreg(
    regressor = RandomForestRegressor(random_state = 123),
    lags = 12,
    transformer_y = PowerTransformer()
    )

forecaster_trans.fit(y = data_train)

forecaster_trans

# Create training matrices
# ==============================================================================
X_train, y_train = forecaster_trans.create_train_X_y(data_train)
X_train.head()

# Predict using the internal regressor
# ==============================================================================
predictions_1 = forecaster_trans.regressor.predict(X_train)
predictions_1[:4]

# Plot predictions
# ==============================================================================
pd.concat([
    pd.Series(predictions_1, name = "fitted", index = data_train.index[forecaster_trans.max_lag:]),
    data_train.rename("train")
], axis = "columns").plot(title = "With transformer");

# Out of sample predictions are OK
# ==============================================================================
predictions_3 = forecaster_trans.predict(steps = steps)
predictions_3.head(3)

# Plot predictions
# ==============================================================================
pd.concat([
    pd.Series(predictions_1, name = "fitted", index = data_train.index[forecaster_trans.max_lag:]),
    data_train.rename("train"),
    data_test.rename("test"),
    predictions_3.rename("forecast")
], axis = "columns").plot(title = "With transformer");

Session information

Preparing metadata (setup.py) ... done
  Building wheel for session-info (setup.py) ... done
Click to view session information
-----
matplotlib          3.7.1
numpy               1.25.2
pandas              2.0.3
session_info        1.0.0
skforecast          0.12.1
sklearn             1.2.2
-----
Click to view modules imported as dependencies
PIL                 9.4.0
backcall            0.2.0
certifi             2024.06.02
cffi                1.16.0
cloudpickle         2.2.1
cycler              0.12.1
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.6
decorator           4.4.2
defusedxml          0.7.1
google              NA
httplib2            0.22.0
ipykernel           5.5.6
ipython_genutils    0.2.0
joblib              1.4.2
kiwisolver          1.4.5
matplotlib_inline   0.1.7
mpl_toolkits        NA
numexpr             2.10.0
packaging           24.1
pexpect             4.9.0
pickleshare         0.7.5
pkg_resources       NA
platformdirs        4.2.2
portpicker          NA
prompt_toolkit      3.0.47
psutil              5.9.5
ptyprocess          0.7.0
pyarrow             14.0.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.16.1
pyparsing           3.1.2
pytz                2023.4
scipy               1.11.4
setuptools          67.7.2
sitecustomize       NA
six                 1.16.0
socks               1.7.1
sphinxcontrib       NA
storemagic          NA
threadpoolctl       3.5.0
tornado             6.3.3
traitlets           5.7.1
typing_extensions   NA
wcwidth             0.2.13
zmq                 24.0.1
zoneinfo            NA
-----
IPython             7.34.0
jupyter_client      6.1.12
jupyter_core        5.7.2
notebook            6.5.5
-----
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Linux-6.1.85+-x86_64-with-glibc2.35
-----
Session information updated at 2024-06-14 11:16

JavierEscobarOrtiz commented 2 weeks ago

Hello @schoulten,

Thanks for using skforecast and opening the issue.

What happens is that the regressor inside the Forecaster is fitted with the transformed data, so when you use its prediction method, the predictions will also be in transformed scale.

Two possible solutions:

Use the Forecaster's transformer_y attribute that store the transformer and use its inverse_transform method:

# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Download data
# ==============================================================================
data = fetch_dataset(
    name="h2o", raw=True, kwargs_read_csv={"names": ["y", "datetime"], "header": 0}
)

# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y-%m-%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Split train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor     = RandomForestRegressor(random_state=123),
                 lags          = 15,
                 transformer_y = StandardScaler()
             )

forecaster.fit(y=data_train)

# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(data_train)

# Predict using the internal regressor
# ==============================================================================
predictions_training = forecaster.regressor.predict(X_train)
predictions_training = pd.Series(data=predictions_training, 
                                 index=data_train.iloc[-len(predictions_training):].index)

# Inverse transform
# ==============================================================================
predictions_training_inverse = forecaster.transformer_y.inverse_transform(predictions_training.to_numpy().reshape(-1, 1))
predictions_training_inverse = pd.Series(data=predictions_training_inverse.flatten(), 
                                         index=data_train.iloc[-len(predictions_training):].index)

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
data_train.plot(ax=ax, label='train')
predictions_training_inverse.plot(ax=ax, label='test')
ax.legend();

Use backtesting_forecaster to predict training data as described in the user guide:

# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from skforecast.model_selection import backtesting_forecaster

# Download data
# ==============================================================================
data = fetch_dataset(
    name="h2o", raw=True, kwargs_read_csv={"names": ["y", "datetime"], "header": 0}
)

# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y-%m-%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Split train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor     = RandomForestRegressor(random_state=123),
                 lags          = 15,
                 transformer_y = StandardScaler()
             )

forecaster.fit(y=data_train)

# Backtesting on training data
# ==============================================================================
metric, predictions_training = backtesting_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data_train,
                                   steps              = 1,
                                   metric             = 'mean_squared_error',
                                   initial_train_size = None,
                                   refit              = False,
                                   verbose            = False,
                                   show_progress      = True
                               )

print(f"Backtest training error: {metric}")
predictions_training.head(3)

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
data_train.plot(ax=ax, label='train')
predictions_training.plot(ax=ax, label='test')
ax.legend();

We will update the user guide to warn about this issue. Thank you very much!

schoulten commented 2 weeks ago

Hi @JavierEscobarOrtiz ,

Thanks for the fast reply! I ended up using the first suggested solution, which plays nice in my existing workflow.

JoaquinAmatRodrigo / skforecast