alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

`fit_predict` in `base.py` uses the whole X to fit, instead of withholding extra rows for predictions, producing ValueError #514

Open clumdee opened 2 years ago

clumdee commented 2 years ago

Describe the bug

Please correct if I understand the concept wrongly.

Should this part https://github.com/alkaline-ml/pmdarima/blob/4869c43796a37f7a83ea56525803c797be3693d9/pmdarima/base.py#L47 be adjusted to this?

self.fit(y, X[:-n_periods], **fit_args)

# TODO: remove kwargs from call
return self.predict(n_periods=n_periods, X=X[-n_periods:], **fit_args)

I can create a PR if this makes sense.

Thank you.

To Reproduce

import pmdarima as pm
from pmdarima import model_selection
from pmdarima import preprocessing as ppc
from pmdarima import arima

import pandas as pd

# Load the data and split it into separate pieces
data = pm.datasets.load_wineind()
train, test = model_selection.train_test_split(data, train_size=150)

fourier = ppc.FourierFeaturizer(12, 4)
_, X = fourier.fit_transform(train)
_, X_pred = fourier.transform(train, n_periods=10)

m = arima.AutoARIMA(stepwise=True, trace=1, error_action="ignore", 
                    seasonal=False,  # because we use Fourier
                    suppress_warnings=True)

# produces error
m.fit_predict(train, X=pd.concat([X, X_pred]).reset_index(drop=True))
# ValueError: Found input variables with inconsistent numbers of samples: [160, 150]

# this works
# m.fit(train, X=X)
# m.predict(10, X_pred)

Versions

System:
    python: 3.9.11 (main, Apr 30 2022, 16:45:04)  [Clang 13.0.0 (clang-1300.0.27.3)]
executable: /Users/xxxxxxxxxxxxxxxxxxxx/bin/python
   machine: macOS-12.4-arm64-arm-64bit

Python dependencies:
        pip: 22.0.4
 setuptools: 62.1.0
    sklearn: 1.1.1
statsmodels: 0.13.2
      numpy: 1.22.4
      scipy: 1.8.1
     Cython: 0.29.30
     pandas: 1.4.2
     joblib: 1.1.0
   pmdarima: 2.0.0
macOS-12.4-arm64-arm-64bit
Python 3.9.11 (main, Apr 30 2022, 16:45:04) 
[Clang 13.0.0 (clang-1300.0.27.3)]
pmdarima 2.0.0
NumPy 1.22.4
SciPy 1.8.1
Scikit-Learn 1.1.1
Statsmodels 0.13.2

Expected Behavior

The method separates exogenous feature for fit and predict to execute as described.

    X : array-like, shape=[n_obs, n_vars], optional (default=None)
        An optional 2-d array of exogenous variables. If provided, these
        variables are used as additional features in the regression
        operation. This should not include a constant or trend. Note that
        if an ``ARIMA`` is fit on exogenous features, it must be provided
        exogenous features for making predictions.

Actual Behavior

The method feeds all exogenous feature to the fit method producing ValueError: Found input variables with inconsistent numbers of samples.

Additional Context

No response

tgsmith61591 commented 1 year ago

Hi @clumdee thanks for the issue. This is an interesting one that I think comes down to the user's intent when calling .fit_predict. If the intention is to predict in-sample values, then your approach is correct and the existing implementation is wrong. However, if the user's intention is to fit a model and forecast n_periods ahead, then the existing implementation is correct, and withholding n_periods from training could create some confusion.

For instance:

import pmdarima as pm
y = pm.datasets.load_wineind()
next_10 = pm.AutoARIMA(seasonal=True, m=12).fit_predict(y)
print(next_10)
# array([21833.71615189, 26239.84853621, 30813.84738283, 35970.36202699,
#        13683.27930437, 20482.58814877, 22439.71347295, 24738.3241369 ,
#        22838.44936401, 25000.73827201])

Given the .predict function is used to forecast future values, my instinct here would be to clear up the confusion with a better docstr, explaining the intended behavior of the function.

clumdee commented 1 year ago

Thanks @tgsmith61591 for taking a look.

Let me try to re-address our discussion a bit.

Please take a look at an adapted and expanded version of your example below. Please kindly share your thoughts.

import pmdarima as pm
y = pm.datasets.load_wineind()

# Ex1. this works -- basically the same as your example, I adjusted their placement to help us compare with other setups
m = pm.AutoARIMA(seasonal=True, m=12)
next_10 = m.fit_predict(y, n_periods=10)
print(next_10)

# Ex2. this does not work -- I tried adding dump exogenous variables to make the case
m = pm.AutoARIMA(seasonal=True, m=12)
next_10 = m.fit_predict(y, X=y.reshape(-1, 1), n_periods=10)
print(next_10)
# ValueError: X array dims (n_rows) != n_periods

# Ex3. this does not work, basically this is the same as Ex2 breaking down according to steps in fit_predict in base.py
m = pm.AutoARIMA(seasonal=True, m=12)
m.fit(y, X=y.reshape(-1, 1))
next_10 = m.predict(n_periods=10, X=y.reshape(-1, 1))
print(next_10)
# ValueError: X array dims (n_rows) != n_periods

# Ex4. this works because we supply the correct amount of exogenous variables for the target n_periods
m = pm.AutoARIMA(seasonal=True, m=12)
m.fit(y, X=y.reshape(-1, 1))
next_10 = m.predict(n_periods=10, X=y.reshape(-1, 1)[:10])
print(next_10)