Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
841 stars 80 forks source link

[forecast] fit / predict outputs different results when using the pipeline for multiple timeseries VS one at a time #216

Closed Cokral closed 10 months ago

Cokral commented 11 months ago

What happened + What you expected to happen

I am trying MLForecast with multiple time-series (using the unique_id variable in my data).

In this example, I have a dataset with 2 unique_id values ["a", "b"]. I want to use a unique MLForecast to model / fit / predict. The models are very simple and only have one feature, a lag of 1. But I noticed the following behaviour: I get a different result if I make one MLForecast with each unique_id VS if I make one MLForecast with both.

My code is the following:

  1. Case where I fit / predict the timeseries with one MLForecast
mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df)

mlf.predict(15)

Outputs:

unique_id | ds | LGBMRegressor
-- | -- | --
0         a 2023-08-06         1.6875
1         b 2023-08-06         1.6875
  1. Case where I fit / predict the timeseries with two MLForecast
mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df[df.unique_id == "a"]) # I filter on the unique_id a

mlf.predict(1)

Outputs:

  unique_id         ds  LGBMRegressor
0         a 2023-08-06            2.9
mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df[df.unique_id == "b"]) # I filter on the unique_id b

mlf.predict(1)

Outputs:

  unique_id         ds  LGBMRegressor
0         b 2023-08-06          0.475

As you can see on the example above, I get different results if I make one MLForecast with both series, or two. My expectation would be that both ways would get the same results, and that each unique series would be treated independently.

Versions / Dependencies

python=3.9.10 mlforecast=0.9.1 pandas=1.5.3 lightgbm=4.0.0

Reproduction script

import pandas as pd
from lightgbm import LGBMRegressor
from mlforecast import MLForecast

# Generate the example data
df_a = pd.DataFrame({
    "ds": pd.to_datetime(["2023-07-02", "2023-07-09", "2023-07-16", "2023-07-23", "2023-07-30"]),
    "y": [2.8, 2.9, 3.0, 2.9, 2.8],
    "unique_id": ["a", "a", "a", "a", "a"]
})
df_b = pd.DataFrame({
    "ds": pd.to_datetime(["2023-07-02", "2023-07-09", "2023-07-16", "2023-07-23", "2023-07-30"]),
    "y": [0.29, 0.26, 0.67, 0.56, 0.41],
    "unique_id": ["b", "b", "b", "b", "b"]
})
df = pd.concat([df_a, df_b])

# Case 1, both timeseries
mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df)

print(mlf.predict(1))

# Case 2
mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df_a)

print(mlf.predict(1))

mlf = MLForecast(
    models=[LGBMRegressor()],
    freq="W",
    num_threads=10,
    lags=[1]
)

mlf = mlf.fit(df_b)

print(mlf.predict(1))

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jmoralez commented 11 months ago

Hey @Cokral, thanks for the detailed report. mlforecast uses a single (global) model for all series, that means that for the case where you use two series the model learns different parameters. This is because it usually performs better this way, and it's faster. If your series aren't very similar you may want to try statsforecast, which trains one model per serie.

Please let us know if you have further doubts.

Cokral commented 11 months ago

Thanks for your answer. I realise that my previous code example didn't show the exact issue, so I updated it using LGBMRegressor (which is the model we are also using in my case).

Is it then still the behaviour expected? So what you advice is to either use StatsForecast or use different MLForecast objects, if the timeseries differ a lot?

brdeleeuw commented 11 months ago

Thanks for the answer @jmoralez . Is this achieved by using the unique_id as categorical feature in the underlying regressors, or some other way?

jmoralez commented 11 months ago

@Cokral Yes, that's expected. I'd suggest trying a couple of things with the global model first:

  1. Using the unique_id as a categorical feature. You can achieve this by giving it a categorical data type and passing it as a static_feature, e.g.

    df['unique_id'] = df['unique_id'].astype('category')
    mlf = MLForecast(..., static_features=['unique_id'])

    This way the id is used as a categorical feature by LightGBM, which can improve performance.

  2. Scaling your data. Having your series in a similar scale could help learning common patterns that are otherwise lost due to scale, you may find this guide useful.

If that doesn't work for you then you could use either approach, one StatsForecast object or several MLForecast objects.

@brdeleeuw the unique_id isn't directly used as a categorical feature unless you do what I described in step 1 here. By default the model is trained only on the features you provide (lags, lag_transforms, date_features, etc). It's more of an empirical task to figure out if using it helps or not, sometimes the other features are enough to get a good result.

github-actions[bot] commented 10 months ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one.