[Core] same values are scored although different time series

fsenzel commented 1 year ago

What happened + What you expected to happen

When using MLForecast's fit/predict or cross validation routines together with a regression model like LightGBM, XGBoost etc., the id_col ('unique_id') is not facilitated while fitting so that different entries with different unique ids are getting same prediction values. This is the case even if the dtype of the field is set to pandas pd.category type. It would be expected that either for each time series a separate model is trained and used while scoring ('local model') or all time series are used for a single model and the unique_id is used as a categorical featuere ('global model').

Versions / Dependencies

mlforecast==0.7.4 xgboost==1.7.5

Reproduction script

import pandas as pd
import numpy as np
from mlforecast import MLForecast

df = (
    pd.DataFrame(
        {
            "ds" : 3*[pd.Timestamp("2017-01-01 00:15:00"), pd.Timestamp("2017-01-01 00:30:00"), pd.Timestamp("2017-01-01 00:45:00"), pd.Timestamp("2017-01-01 01:00:00")],
            "unique_id" : 4*["ts1"] + 4*["ts2"] + 4*["ts3"],
            "y" : np.arange(0,12)
        },
    )
    .assign(unique_id = lambda df_: pd.Categorical(df_.unique_id))
)

fcst = MLForecast(
    models=[
        xgb.XGBRegressor(),
    ],
    freq='15T',
    date_features=["hour", "minute", "day", "month"]
)

scores = (
    fcst.fit(
        df,
    )
    .predict(horizon=1)
)

Result Bildschirmfoto 2023-07-24 um 12 00 54

Issue Severity

Medium: It is a significant difficulty but I can work around it.

fsenzel commented 1 year ago

If using directly e.g. xgboost (together with "categorical features" enabled) one obtains the following, expected behaviour:

import pandas as pd
import numpy as np
from mlforecast import MLForecast

df = (
    pd.DataFrame(
        {
            "ds" : 3*[pd.Timestamp("2017-01-01 00:15:00"), pd.Timestamp("2017-01-01 00:30:00"), pd.Timestamp("2017-01-01 00:45:00"), pd.Timestamp("2017-01-01 01:00:00")],
            "unique_id" : 4*["ts1"] + 4*["ts2"] + 4*["ts3"],
            "y" : np.arange(0,12)
        },
    )
    .assign(unique_id = lambda df_: pd.Categorical(df_.unique_id))
)

fcst = MLForecast(
    models=[
        xgb.XGBRegressor(),
    ],
    freq='15T',
    date_features=["hour", "minute", "day", "month"]
)

X_train, y_train = fcst.preprocess(df.query("ds < '2017-01-01 01:00'"), return_X_y=True)
X_test, y_test = fcst.preprocess(df.query("ds == '2017-01-01 01:00'"), return_X_y=True)

(
    df
    .query("ds == '2017-01-01 01:00'")
    .assign(
        XGBRegressor=xgb.XGBRegressor(tree_method="gpu_hist", enable_categorical=True).fit(X_train.drop(columns=["ds"]), y_train).predict(X_test.drop(columns=["ds"]))
    )
)

Bildschirmfoto 2023-07-24 um 13 08 04

jmoralez commented 1 year ago

Hi @fsenzel, thanks for using mlforecast. If you want to use the id column as a feature you have to be explicit by setting static_features=['unique_id'] in the fit call.

Please let us know if this helps.

fsenzel commented 1 year ago

Hi @jmoralez,

thanks for your reply. Indeed, explicitly giving the "unique_id" field as a static feature resolves the problem and passes the id as a categorical feature into the model. Maybe one additional question, if the unique_id column is not used as a categorical feature per default, where else is the unique_id used within the processing?

jmoralez commented 1 year ago

It's mainly used for computing the lag features, internally we build an array with the series' values and we use the id to know where each serie begins and ends.

fsenzel commented 1 year ago

Thanks for the clarification, that makes sense!

Nixtla / mlforecast