Closed fsenzel closed 1 year ago
If using directly e.g. xgboost (together with "categorical features" enabled) one obtains the following, expected behaviour:
import pandas as pd
import numpy as np
from mlforecast import MLForecast
df = (
pd.DataFrame(
{
"ds" : 3*[pd.Timestamp("2017-01-01 00:15:00"), pd.Timestamp("2017-01-01 00:30:00"), pd.Timestamp("2017-01-01 00:45:00"), pd.Timestamp("2017-01-01 01:00:00")],
"unique_id" : 4*["ts1"] + 4*["ts2"] + 4*["ts3"],
"y" : np.arange(0,12)
},
)
.assign(unique_id = lambda df_: pd.Categorical(df_.unique_id))
)
fcst = MLForecast(
models=[
xgb.XGBRegressor(),
],
freq='15T',
date_features=["hour", "minute", "day", "month"]
)
X_train, y_train = fcst.preprocess(df.query("ds < '2017-01-01 01:00'"), return_X_y=True)
X_test, y_test = fcst.preprocess(df.query("ds == '2017-01-01 01:00'"), return_X_y=True)
(
df
.query("ds == '2017-01-01 01:00'")
.assign(
XGBRegressor=xgb.XGBRegressor(tree_method="gpu_hist", enable_categorical=True).fit(X_train.drop(columns=["ds"]), y_train).predict(X_test.drop(columns=["ds"]))
)
)
Hi @fsenzel, thanks for using mlforecast. If you want to use the id column as a feature you have to be explicit by setting static_features=['unique_id']
in the fit call.
Please let us know if this helps.
Hi @jmoralez,
thanks for your reply. Indeed, explicitly giving the "unique_id" field as a static feature resolves the problem and passes the id as a categorical feature into the model. Maybe one additional question, if the unique_id column is not used as a categorical feature per default, where else is the unique_id used within the processing?
It's mainly used for computing the lag features, internally we build an array with the series' values and we use the id to know where each serie begins and ends.
Thanks for the clarification, that makes sense!
What happened + What you expected to happen
When using MLForecast's fit/predict or cross validation routines together with a regression model like LightGBM, XGBoost etc., the id_col ('unique_id') is not facilitated while fitting so that different entries with different unique ids are getting same prediction values. This is the case even if the dtype of the field is set to pandas pd.category type. It would be expected that either for each time series a separate model is trained and used while scoring ('local model') or all time series are used for a single model and the unique_id is used as a categorical featuere ('global model').
Versions / Dependencies
mlforecast==0.7.4 xgboost==1.7.5
Reproduction script
Result
Issue Severity
Medium: It is a significant difficulty but I can work around it.