Open schoulten opened 2 weeks ago
Hello @schoulten,
Thanks for using skforecast and opening the issue.
What happens is that the regressor inside the Forecaster is fitted with the transformed data, so when you use its prediction method, the predictions will also be in transformed scale.
Two possible solutions:
transformer_y
attribute that store the transformer and use its inverse_transform
method:# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Download data
# ==============================================================================
data = fetch_dataset(
name="h2o", raw=True, kwargs_read_csv={"names": ["y", "datetime"], "header": 0}
)
# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y-%m-%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()
# Split train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test = data[-steps:]
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 15,
transformer_y = StandardScaler()
)
forecaster.fit(y=data_train)
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(data_train)
# Predict using the internal regressor
# ==============================================================================
predictions_training = forecaster.regressor.predict(X_train)
predictions_training = pd.Series(data=predictions_training,
index=data_train.iloc[-len(predictions_training):].index)
# Inverse transform
# ==============================================================================
predictions_training_inverse = forecaster.transformer_y.inverse_transform(predictions_training.to_numpy().reshape(-1, 1))
predictions_training_inverse = pd.Series(data=predictions_training_inverse.flatten(),
index=data_train.iloc[-len(predictions_training):].index)
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
data_train.plot(ax=ax, label='train')
predictions_training_inverse.plot(ax=ax, label='test')
ax.legend();
backtesting_forecaster
to predict training data as described in the user guide:# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from skforecast.model_selection import backtesting_forecaster
# Download data
# ==============================================================================
data = fetch_dataset(
name="h2o", raw=True, kwargs_read_csv={"names": ["y", "datetime"], "header": 0}
)
# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y-%m-%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()
# Split train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test = data[-steps:]
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 15,
transformer_y = StandardScaler()
)
forecaster.fit(y=data_train)
# Backtesting on training data
# ==============================================================================
metric, predictions_training = backtesting_forecaster(
forecaster = forecaster,
y = data_train,
steps = 1,
metric = 'mean_squared_error',
initial_train_size = None,
refit = False,
verbose = False,
show_progress = True
)
print(f"Backtest training error: {metric}")
predictions_training.head(3)
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
data_train.plot(ax=ax, label='train')
predictions_training.plot(ax=ax, label='test')
ax.legend();
We will update the user guide to warn about this issue. Thank you very much!
Hi @JavierEscobarOrtiz ,
Thanks for the fast reply! I ended up using the first suggested solution, which plays nice in my existing workflow.
The problem
In order to generate in-sample predictions (aka fitted values), you need to create training matrices with
.create_train_X_y()
and use it with.predict()
in the internal regressor, as described in the docs. But when any transformation is given to theForecasterAutoreg()
, it appears that the in-sample predictions are not being reverted to the original scale of the data.Is it something that I missing or there is a way to revert the transformation?
Reproducible example
Session information