Use branch 0.14.x as base.

Summary

Currently, the steps parameter in all Forecasters' predict methods only accepts an integer value. This integer defines how many observations to forecast into the future. We would like to extend this functionality so that steps can also accept a date (e.g., '2020-01-01'). If a date is provided, the function should calculate the appropriate number of observations corresponding to the time window between the last observation in the last window and the given date.

Task

Create an auxiliary function, _preprocess_steps_as_date(last_window: pd.Series, steps) in the utils module:

last_window is the last window of the series used to forecast the future. This is an argument of the predict method in all Forecasters.
steps can be an integer or any datetime format that pandas allows to be passed to a pd.DatetimeIndex (e.g., string, pandas timestamp...).
If the Forecaster was not fitted using a pd.DatetimeIndex, raise a TypeError with the message: "If the Forecaster was not fitted using a pd.DatetimeIndex, steps must be an integer."
If the Forecaster was fitted using a pd.DatetimeIndex, this function will return the length of the time window between the last observation in the last window and the given date as an integer value.
If the input steps is an integer, return the same integer.
Create unit tests using pytest in the utils.tests folder.

# Expected behavior
# ==============================================================================
last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_steps_as_date(last_window, '2020-01-07') # expected output: 2

last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_steps_as_date(last_window, 2) # expected output: 2

last_window = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1))
_preprocess_steps_as_date(last_window, '2020-01-07') # expected output: TypeError

Integrate this function in the predict method of the ForecasterAutoreg class.

Acceptance Criteria

[ ] The steps parameter accepts both integer and date formats.
[ ] The function correctly calculates the number of steps when a date is provided.
[ ] Existing tests continue to pass.
[ ] New test cases are added to verify the correct behavior for both int and date inputs.

Full Example

# Expected behavior
# ==============================================================================
data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})

steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags      = 15 
             )
forecaster.fit(y=data_train['y'])

predictions = forecaster.predict(steps='2005-09-01') # As steps=3

2005-07-01 1.020833 2005-08-01 1.021721 2005-09-01 1.093488 Freq: MS, Name: pred, dtype: float64

Hello, I was part of the PyData event (although in another repository) and just had a quick look at it... I leave you my progress and some doubts regarding the FullExample

Function in utils.py (I removed the underscore to be able to import the function from the test script, don't really know if there is a better way...)

def preprocess_steps_as_date(last_window: pd.Series, steps: Union[int, str, pd.Timestamp]) -> int:
if not isinstance(last_window.index, pd.DatetimeIndex):
    if isinstance(steps, int):
        return steps
    raise TypeError("If the Forecaster was not fitted using a pd.DatetimeIndex, steps must be an integer.")

if isinstance(steps, (str, pd.Timestamp)):
    target_date = pd.to_datetime(steps)
    last_date = last_window.index[-1]
    if target_date <= last_date:
        raise ValueError("The provided date is earlier than or equal to the last observation date.")

    steps_diff = pd.date_range(start=last_date, end=target_date, freq=last_window.index.freq)
    return len(steps_diff) - 1

return steps

test_preprocess_steps_as_date


# Unit test preprocess_steps_as_date
# ==============================================================================
import re
import pytest
import numpy as np
import pandas as pd
from skforecast.exceptions import IgnoredArgumentWarning
from skforecast.utils import preprocess_steps_as_date

def test_preprocess_steps_as_date_with_int(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, 2) == 2

def test_preprocess_steps_as_date_with_date(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, '2020-01-07') == 2

def test_preprocess_steps_as_date_with_date_before(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) with pytest.raises(ValueError): preprocess_steps_as_date(last_window, '2020-01-04')

def test_preprocess_steps_as_date_with_rangeindex(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1)) with pytest.raises(TypeError): preprocess_steps_as_date(last_window, '2020-01-07')

Raise error if format is different to YYYY-MM-DD?

* FullExample
``` python
from skforecast.utils import preprocess_steps_as_date
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from lightgbm import LGBMRegressor
import pandas as pd

if __name__ == "__main__":
    # Expected behavior
    # ==============================================================================
    data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})

    # New lines to set the index of the dataframe
    data['datetime'] = pd.to_datetime(data['datetime'])
    data.set_index('datetime', inplace=True)

    steps = 36
    data_train = data[:-steps]
    data_test  = data[-steps:]

    forecaster = ForecasterAutoreg(
                    regressor = LGBMRegressor(random_state=123, verbose=-1),
                    lags      = 15 
                )
    forecaster.fit(y=data_train['y'])
    predictions = forecaster.predict(steps='2005-09-01') # As steps=3
    print(predictions)

Output:

/Users/*/opt/anaconda3/envs/skf_env/bin/python /Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/dates.py
h2o
---
Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health
system had between 1991 and 2008.
Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice(3rd
Edition). http://pkg.robjhyndman.com/fpp3package/,https://github.com/robjhyndman
/fpp3package, http://OTexts.com/fpp3.
Shape of the dataset: (204, 2)
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1252: UserWarning: `last_window` has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
  warnings.warn(
15     1.020833
16     1.021721
17     1.093488
18     1.145198
19     1.131161
         ...   
102    1.160739
103    1.144070
104    1.164184
105    1.137746
106    0.719586

As you can see the index of the output do not contain the dates... The functions preprocess_y and preprocess_last_window for the data provided consider the .index.freq as None... Should this be passed when defining the dataframe, or should we adopt these methods? If you can guide me, I can check a little bit further. Cheers! :)

Hello @imMoya,

Thank you very much for your contribution, your code seems to be aligned with the library. Actually, the problem is a bit deeper than what was originally explained in the Issue. So if you would like to go further with this implementation, we would be more than happy to do so. Let me explain:

Instead of creating a new function, our vision is to modify the utils.expand_index function to preprocess the steps argument (as you did in your code) and then return it as an integer. I have included some TODOs in the code:

def expand_index(
    index: Union[pd.Index, None], 
    steps: int
) -> pd.Index:
    """
    Create a new index of length `steps` starting at the end of the index.

    Parameters
    ----------
    index : pandas Index, None
        Original index.
    steps : int
        Number of steps to expand.

    Returns
    -------
    new_index : pandas Index
        New index.

    """
    # TODO: Update function docstring and typing
    # TODO: include the code needed to preprocess steps if it is a date

    if isinstance(index, pd.Index):

        if isinstance(index, pd.DatetimeIndex):
            new_index = pd.date_range(
                            start   = index[-1] + index.freq,
                            periods = steps,
                            freq    = index.freq
                        )
        elif isinstance(index, pd.RangeIndex):
            new_index = pd.RangeIndex(
                            start = index[-1] + 1,
                            stop  = index[-1] + 1 + steps
                        )
        else:
            raise TypeError(
                "Argument `index` must be a pandas DatetimeIndex or RangeIndex."
            )
    else:
        new_index = pd.RangeIndex(
                        start = 0,
                        stop  = steps
                    )

    # TODO: add `steps` as a return value
    return new_index

This function, in skforecast 0.14.0, is called in the _create_predict_inputs method of the forecaster (see ForecasterAutoreg). So the idea is to also to include steps as a return from this method to pass it to the predict methods.

Some recommendations that we can give you:

Integrating your function with expand_index allows you to work directly with an index as an argument. In the Forecaster _create_predict_input method, we pass the last_window_index as this argument.
The first error, I will change the message to something not related to the Forecaster. This is because the user can use this function outside of a Forecaster object. In the expand_index function it will say something like "If index is not a pd.DatetimeIndex, steps must be an integer.
It is good to add an error if the index is a pd.DatetimeIndex but with no freq. pd.date_range might fail.

If you have any questions, we will be happy to help!

Best,

Javi

Hola Javi, Forgot to mention the issue in the commit I did. I'm doing some changes in my fork and have done a new branch. I'm having a problem in defining the correct index frequency for the dataframe to appropriately test the methods... But will update if I advance. Cheers, Nacho

JoaquinAmatRodrigo / skforecast

Good First Issue: Allow `predict` method to accept date values as `steps` #811

Raise error if format is different to YYYY-MM-DD?