JoaquinAmatRodrigo / skforecast

Time series forecasting with machine learning models
https://skforecast.org
BSD 3-Clause "New" or "Revised" License
1.11k stars 128 forks source link

Good First Issue: Allow `predict` method to accept date values as `steps` #811

Open JavierEscobarOrtiz opened 1 week ago

JavierEscobarOrtiz commented 1 week ago

Use branch 0.14.x as base.

Summary

Currently, the steps parameter in all Forecasters' predict methods only accepts an integer value. This integer defines how many observations to forecast into the future. We would like to extend this functionality so that steps can also accept a date (e.g., '2020-01-01'). If a date is provided, the function should calculate the appropriate number of observations corresponding to the time window between the last observation in the last window and the given date.

Task

  1. Create an auxiliary function, _preprocess_steps_as_date(last_window: pd.Series, steps) in the utils module:
# Expected behavior
# ==============================================================================
last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_steps_as_date(last_window, '2020-01-07') # expected output: 2

last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_steps_as_date(last_window, 2) # expected output: 2

last_window = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1))
_preprocess_steps_as_date(last_window, '2020-01-07') # expected output: TypeError
  1. Integrate this function in the predict method of the ForecasterAutoreg class.

Acceptance Criteria

Full Example

# Expected behavior
# ==============================================================================
data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})

steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags      = 15 
             )
forecaster.fit(y=data_train['y'])

predictions = forecaster.predict(steps='2005-09-01') # As steps=3

2005-07-01 1.020833 2005-08-01 1.021721 2005-09-01 1.093488 Freq: MS, Name: pred, dtype: float64

JavierEscobarOrtiz commented 1 week ago

@astrojuanlu

imMoya commented 6 days ago

Hello, I was part of the PyData event (although in another repository) and just had a quick look at it... I leave you my progress and some doubts regarding the FullExample

def test_preprocess_steps_as_date_with_int(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, 2) == 2

def test_preprocess_steps_as_date_with_date(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, '2020-01-07') == 2

def test_preprocess_steps_as_date_with_date_before(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) with pytest.raises(ValueError): preprocess_steps_as_date(last_window, '2020-01-04')

def test_preprocess_steps_as_date_with_rangeindex(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1)) with pytest.raises(TypeError): preprocess_steps_as_date(last_window, '2020-01-07')

Raise error if format is different to YYYY-MM-DD?

* FullExample
``` python
from skforecast.utils import preprocess_steps_as_date
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from lightgbm import LGBMRegressor
import pandas as pd

if __name__ == "__main__":
    # Expected behavior
    # ==============================================================================
    data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})

    # New lines to set the index of the dataframe
    data['datetime'] = pd.to_datetime(data['datetime'])
    data.set_index('datetime', inplace=True)

    steps = 36
    data_train = data[:-steps]
    data_test  = data[-steps:]

    forecaster = ForecasterAutoreg(
                    regressor = LGBMRegressor(random_state=123, verbose=-1),
                    lags      = 15 
                )
    forecaster.fit(y=data_train['y'])
    predictions = forecaster.predict(steps='2005-09-01') # As steps=3
    print(predictions)

Output:

/Users/*/opt/anaconda3/envs/skf_env/bin/python /Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/dates.py
h2o
---
Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health
system had between 1991 and 2008.
Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice(3rd
Edition). http://pkg.robjhyndman.com/fpp3package/,https://github.com/robjhyndman
/fpp3package, http://OTexts.com/fpp3.
Shape of the dataset: (204, 2)
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1252: UserWarning: `last_window` has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
15     1.020833
16     1.021721
17     1.093488
18     1.145198
19     1.131161
         ...   
102    1.160739
103    1.144070
104    1.164184
105    1.137746
106    0.719586

As you can see the index of the output do not contain the dates... The functions preprocess_y and preprocess_last_window for the data provided consider the .index.freq as None... Should this be passed when defining the dataframe, or should we adopt these methods? If you can guide me, I can check a little bit further. Cheers! :)

JavierEscobarOrtiz commented 6 days ago

Hello @imMoya,

Thank you very much for your contribution, your code seems to be aligned with the library. Actually, the problem is a bit deeper than what was originally explained in the Issue. So if you would like to go further with this implementation, we would be more than happy to do so. Let me explain:

  1. Instead of creating a new function, our vision is to modify the utils.expand_index function to preprocess the steps argument (as you did in your code) and then return it as an integer. I have included some TODOs in the code:
def expand_index(
    index: Union[pd.Index, None], 
    steps: int
) -> pd.Index:
    """
    Create a new index of length `steps` starting at the end of the index.

    Parameters
    ----------
    index : pandas Index, None
        Original index.
    steps : int
        Number of steps to expand.

    Returns
    -------
    new_index : pandas Index
        New index.

    """
    # TODO: Update function docstring and typing
    # TODO: include the code needed to preprocess steps if it is a date

    if isinstance(index, pd.Index):

        if isinstance(index, pd.DatetimeIndex):
            new_index = pd.date_range(
                            start   = index[-1] + index.freq,
                            periods = steps,
                            freq    = index.freq
                        )
        elif isinstance(index, pd.RangeIndex):
            new_index = pd.RangeIndex(
                            start = index[-1] + 1,
                            stop  = index[-1] + 1 + steps
                        )
        else:
            raise TypeError(
                "Argument `index` must be a pandas DatetimeIndex or RangeIndex."
            )
    else:
        new_index = pd.RangeIndex(
                        start = 0,
                        stop  = steps
                    )

    # TODO: add `steps` as a return value
    return new_index
  1. This function, in skforecast 0.14.0, is called in the _create_predict_inputs method of the forecaster (see ForecasterAutoreg). So the idea is to also to include steps as a return from this method to pass it to the predict methods.

Some recommendations that we can give you:

If you have any questions, we will be happy to help!

Best,

Javi

imMoya commented 5 days ago

Hola Javi, Forgot to mention the issue in the commit I did. I'm doing some changes in my fork and have done a new branch. I'm having a problem in defining the correct index frequency for the dataframe to appropriately test the methods... But will update if I advance. Cheers, Nacho