`fit_predict` in `base.py` uses the whole X to fit, instead of withholding extra rows for predictions, producing ValueError

clumdee commented 2 years ago

Describe the bug

Please correct if I understand the concept wrongly.

Should this part https://github.com/alkaline-ml/pmdarima/blob/4869c43796a37f7a83ea56525803c797be3693d9/pmdarima/base.py#L47 be adjusted to this?

self.fit(y, X[:-n_periods], **fit_args)

# TODO: remove kwargs from call
return self.predict(n_periods=n_periods, X=X[-n_periods:], **fit_args)

I can create a PR if this makes sense.

Thank you.

To Reproduce

import pmdarima as pm
from pmdarima import model_selection
from pmdarima import preprocessing as ppc
from pmdarima import arima

import pandas as pd

# Load the data and split it into separate pieces
data = pm.datasets.load_wineind()
train, test = model_selection.train_test_split(data, train_size=150)

fourier = ppc.FourierFeaturizer(12, 4)
_, X = fourier.fit_transform(train)
_, X_pred = fourier.transform(train, n_periods=10)

m = arima.AutoARIMA(stepwise=True, trace=1, error_action="ignore", 
                    seasonal=False,  # because we use Fourier
                    suppress_warnings=True)

# produces error
m.fit_predict(train, X=pd.concat([X, X_pred]).reset_index(drop=True))
# ValueError: Found input variables with inconsistent numbers of samples: [160, 150]

# this works
# m.fit(train, X=X)
# m.predict(10, X_pred)

Versions

System:
    python: 3.9.11 (main, Apr 30 2022, 16:45:04)  [Clang 13.0.0 (clang-1300.0.27.3)]
executable: /Users/xxxxxxxxxxxxxxxxxxxx/bin/python
   machine: macOS-12.4-arm64-arm-64bit

Python dependencies:
        pip: 22.0.4
 setuptools: 62.1.0
    sklearn: 1.1.1
statsmodels: 0.13.2
      numpy: 1.22.4
      scipy: 1.8.1
     Cython: 0.29.30
     pandas: 1.4.2
     joblib: 1.1.0
   pmdarima: 2.0.0
macOS-12.4-arm64-arm-64bit
Python 3.9.11 (main, Apr 30 2022, 16:45:04) 
[Clang 13.0.0 (clang-1300.0.27.3)]
pmdarima 2.0.0
NumPy 1.22.4
SciPy 1.8.1
Scikit-Learn 1.1.1
Statsmodels 0.13.2

Expected Behavior

The method separates exogenous feature for fit and predict to execute as described.

    X : array-like, shape=[n_obs, n_vars], optional (default=None)
        An optional 2-d array of exogenous variables. If provided, these
        variables are used as additional features in the regression
        operation. This should not include a constant or trend. Note that
        if an ``ARIMA`` is fit on exogenous features, it must be provided
        exogenous features for making predictions.

Actual Behavior

The method feeds all exogenous feature to the fit method producing ValueError: Found input variables with inconsistent numbers of samples.

Additional Context

No response

tgsmith61591 commented 1 year ago

Hi @clumdee thanks for the issue. This is an interesting one that I think comes down to the user's intent when calling .fit_predict. If the intention is to predict in-sample values, then your approach is correct and the existing implementation is wrong. However, if the user's intention is to fit a model and forecast n_periods ahead, then the existing implementation is correct, and withholding n_periods from training could create some confusion.

For instance:

import pmdarima as pm
y = pm.datasets.load_wineind()
next_10 = pm.AutoARIMA(seasonal=True, m=12).fit_predict(y)
print(next_10)
# array([21833.71615189, 26239.84853621, 30813.84738283, 35970.36202699,
#        13683.27930437, 20482.58814877, 22439.71347295, 24738.3241369 ,
#        22838.44936401, 25000.73827201])

Given the .predict function is used to forecast future values, my instinct here would be to clear up the confusion with a better docstr, explaining the intended behavior of the function.

clumdee commented 1 year ago

Thanks @tgsmith61591 for taking a look.

Let me try to re-address our discussion a bit.

Your point on user intention is very interesting. I agree that it could be cleared up with a better docstr / explanation.
I agree that a natural intention should be to fit a model and forecast n_periods ahead.
The function works as we expect when there are no exogenous variables as in your example.
However, the main issue here is that the function does not work properly with exogenous variables. This was actually my intention in the example to reproduce.

Please take a look at an adapted and expanded version of your example below. Please kindly share your thoughts.

import pmdarima as pm
y = pm.datasets.load_wineind()

# Ex1. this works -- basically the same as your example, I adjusted their placement to help us compare with other setups
m = pm.AutoARIMA(seasonal=True, m=12)
next_10 = m.fit_predict(y, n_periods=10)
print(next_10)

# Ex2. this does not work -- I tried adding dump exogenous variables to make the case
m = pm.AutoARIMA(seasonal=True, m=12)
next_10 = m.fit_predict(y, X=y.reshape(-1, 1), n_periods=10)
print(next_10)
# ValueError: X array dims (n_rows) != n_periods

# Ex3. this does not work, basically this is the same as Ex2 breaking down according to steps in fit_predict in base.py
m = pm.AutoARIMA(seasonal=True, m=12)
m.fit(y, X=y.reshape(-1, 1))
next_10 = m.predict(n_periods=10, X=y.reshape(-1, 1))
print(next_10)
# ValueError: X array dims (n_rows) != n_periods

# Ex4. this works because we supply the correct amount of exogenous variables for the target n_periods
m = pm.AutoARIMA(seasonal=True, m=12)
m.fit(y, X=y.reshape(-1, 1))
next_10 = m.predict(n_periods=10, X=y.reshape(-1, 1)[:10])
print(next_10)

alkaline-ml / pmdarima