Open JavierEscobarOrtiz opened 1 week ago
@astrojuanlu
Hello, I was part of the PyData event (although in another repository) and just had a quick look at it... I leave you my progress and some doubts regarding the FullExample
Function in utils.py
(I removed the underscore to be able to import the function from the test script, don't really know if there is a better way...)
def preprocess_steps_as_date(last_window: pd.Series, steps: Union[int, str, pd.Timestamp]) -> int:
if not isinstance(last_window.index, pd.DatetimeIndex):
if isinstance(steps, int):
return steps
raise TypeError("If the Forecaster was not fitted using a pd.DatetimeIndex, steps must be an integer.")
if isinstance(steps, (str, pd.Timestamp)):
target_date = pd.to_datetime(steps)
last_date = last_window.index[-1]
if target_date <= last_date:
raise ValueError("The provided date is earlier than or equal to the last observation date.")
steps_diff = pd.date_range(start=last_date, end=target_date, freq=last_window.index.freq)
return len(steps_diff) - 1
return steps
test_preprocess_steps_as_date
# Unit test preprocess_steps_as_date
# ==============================================================================
import re
import pytest
import numpy as np
import pandas as pd
from skforecast.exceptions import IgnoredArgumentWarning
from skforecast.utils import preprocess_steps_as_date
def test_preprocess_steps_as_date_with_int(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, 2) == 2
def test_preprocess_steps_as_date_with_date(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) assert preprocess_steps_as_date(last_window, '2020-01-07') == 2
def test_preprocess_steps_as_date_with_date_before(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D')) with pytest.raises(ValueError): preprocess_steps_as_date(last_window, '2020-01-04')
def test_preprocess_steps_as_date_with_rangeindex(): last_window = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1)) with pytest.raises(TypeError): preprocess_steps_as_date(last_window, '2020-01-07')
* FullExample
``` python
from skforecast.utils import preprocess_steps_as_date
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from lightgbm import LGBMRegressor
import pandas as pd
if __name__ == "__main__":
# Expected behavior
# ==============================================================================
data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})
# New lines to set the index of the dataframe
data['datetime'] = pd.to_datetime(data['datetime'])
data.set_index('datetime', inplace=True)
steps = 36
data_train = data[:-steps]
data_test = data[-steps:]
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 15
)
forecaster.fit(y=data_train['y'])
predictions = forecaster.predict(steps='2005-09-01') # As steps=3
print(predictions)
Output:
/Users/*/opt/anaconda3/envs/skf_env/bin/python /Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/dates.py
h2o
---
Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health
system had between 1991 and 2008.
Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice(3rd
Edition). http://pkg.robjhyndman.com/fpp3package/,https://github.com/robjhyndman
/fpp3package, http://OTexts.com/fpp3.
Shape of the dataset: (204, 2)
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1190: UserWarning: Series has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
warnings.warn(
/Users/*/Library/CloudStorage/OneDrive-Personal/obsidian/01_projects/skforecast/skforecast/utils/utils.py:1252: UserWarning: `last_window` has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1.
warnings.warn(
15 1.020833
16 1.021721
17 1.093488
18 1.145198
19 1.131161
...
102 1.160739
103 1.144070
104 1.164184
105 1.137746
106 0.719586
As you can see the index of the output do not contain the dates... The functions preprocess_y
and preprocess_last_window
for the data provided consider the .index.freq
as None... Should this be passed when defining the dataframe, or should we adopt these methods?
If you can guide me, I can check a little bit further.
Cheers! :)
Hello @imMoya,
Thank you very much for your contribution, your code seems to be aligned with the library. Actually, the problem is a bit deeper than what was originally explained in the Issue. So if you would like to go further with this implementation, we would be more than happy to do so. Let me explain:
utils.expand_index
function to preprocess the steps
argument (as you did in your code) and then return it as an integer. I have included some TODOs in the code:def expand_index(
index: Union[pd.Index, None],
steps: int
) -> pd.Index:
"""
Create a new index of length `steps` starting at the end of the index.
Parameters
----------
index : pandas Index, None
Original index.
steps : int
Number of steps to expand.
Returns
-------
new_index : pandas Index
New index.
"""
# TODO: Update function docstring and typing
# TODO: include the code needed to preprocess steps if it is a date
if isinstance(index, pd.Index):
if isinstance(index, pd.DatetimeIndex):
new_index = pd.date_range(
start = index[-1] + index.freq,
periods = steps,
freq = index.freq
)
elif isinstance(index, pd.RangeIndex):
new_index = pd.RangeIndex(
start = index[-1] + 1,
stop = index[-1] + 1 + steps
)
else:
raise TypeError(
"Argument `index` must be a pandas DatetimeIndex or RangeIndex."
)
else:
new_index = pd.RangeIndex(
start = 0,
stop = steps
)
# TODO: add `steps` as a return value
return new_index
_create_predict_inputs
method of the forecaster (see ForecasterAutoreg
). So the idea is to also to include steps
as a return from this method to pass it to the predict
methods.Some recommendations that we can give you:
expand_index
allows you to work directly with an index as an argument. In the Forecaster _create_predict_input
method, we pass the last_window_index as this argument. expand_index
function it will say something like "If index
is not a pd.DatetimeIndex, steps must be an integer.pd.date_range
might fail.If you have any questions, we will be happy to help!
Best,
Javi
Hola Javi, Forgot to mention the issue in the commit I did. I'm doing some changes in my fork and have done a new branch. I'm having a problem in defining the correct index frequency for the dataframe to appropriately test the methods... But will update if I advance. Cheers, Nacho
Use branch 0.14.x as base.
Summary
Currently, the
steps
parameter in all Forecasters'predict
methods only accepts an integer value. This integer defines how many observations to forecast into the future. We would like to extend this functionality so thatsteps
can also accept a date (e.g.,'2020-01-01'
). If a date is provided, the function should calculate the appropriate number of observations corresponding to the time window between the last observation in the last window and the given date.Task
_preprocess_steps_as_date(last_window: pd.Series, steps)
in theutils
module:last_window
is the last window of the series used to forecast the future. This is an argument of thepredict
method in all Forecasters.steps
can be an integer or any datetime format that pandas allows to be passed to apd.DatetimeIndex
(e.g., string, pandas timestamp...).pd.DatetimeIndex
, raise aTypeError
with the message: "If the Forecaster was not fitted using a pd.DatetimeIndex,steps
must be an integer."pd.DatetimeIndex
, this function will return the length of the time window between the last observation in the last window and the given date as an integer value.steps
is an integer, return the same integer.utils.tests
folder.predict
method of theForecasterAutoreg
class.Acceptance Criteria
steps
parameter accepts both integer and date formats.Full Example
2005-07-01 1.020833 2005-08-01 1.021721 2005-09-01 1.093488 Freq: MS, Name: pred, dtype: float64