Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
789 stars 74 forks source link

[date features]: dayofweek_cat - day of week as a one hot encoding feature #315

Open nelsoncardenas opened 4 months ago

nelsoncardenas commented 4 months ago

Description

In my search through the documentation for the dayofweek parameter usage in date_feature inputs, I noticed that dayofweek is treated as an ordinal feature. However, for models such as linear regression, representing this as a one-hot encoding feature could be more efective.

Here are some suggestions I've considered:

Use case

A user wants to:

My test

import pandas as pd
from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series

def dayofweek_cat(dates):
    num_to_text = {
        0: "monday",
        1: "tuesday",
        2: "wednesday",
        3: "thursday",
        4: "friday",
        5: "saturday",
        6: "sunday",
    }
    df_dayofweek_cat = pd.get_dummies(dates.dayofweek).astype("uint8")
    df_dayofweek_cat.columns = [f"is_{num_to_text[col]}" for col in df_cats.columns]
    return df_dayofweek_cat

series = generate_daily_series(1, min_length=6, max_length=6)
print(f"output dayofweek_cat function: {dayofweek_cat(series['ds'].dt).columns}")

fcst = MLForecast([], freq="D", date_features=["dayofweek", "dayofyear", dayofweek_cat])
fcst.preprocess(series)
jmoralez commented 4 months ago

Hey @nelsoncardenas, thanks for using mlforecast and for the detailed report. I think the easiest way to achieve this is with a scikit-learn pipeline. Here's an example:

import pandas as pd
from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

series = generate_daily_series(1, min_length=7, max_length=7)
model = make_pipeline(
    ColumnTransformer(
        [('encoder', OneHotEncoder(drop='first'), ['dayofweek'])],
        remainder='passthrough'
    ),
    LinearRegression()
)
fcst = MLForecast(models={'lr': model}, freq="D", date_features=["dayofweek"])
fcst.fit(series)
print(fcst.models_['lr'].named_steps['linearregression'].n_features_in_)  # 6

The available attributes are:

If you have time and would like to do it we'd appreciate a PR that explicitly lists the supported ones.

nelsoncardenas commented 4 months ago

Thank you @jmoralez I'd like to help with that PR.

What would be the suggested steps?

jmoralez commented 4 months ago

I think you could add two lists (one for pandas and one for polars) in the nbs/core.ipynb notebook. We have this file with some contributing guidelines, but the first step should be to fork this repository and work on your fork instead (I'll fix that soon). Let me know if you have any questions.

nelsoncardenas commented 4 months ago

@jmoralez Thank you. During the week I will dedicate some free time to it.