alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.58k stars 232 forks source link

Input contains NaN for a non NaN data #573

Open tifa64 opened 6 months ago

tifa64 commented 6 months ago

Describe the question you have

Hello maintainers, I want to understand why this scenario happens, I have the following timeseries

import pandas as pd
data = {
    'date': pd.date_range(start='2023-01-01', periods=10, freq='MS'),
    'value': [1, 3, 3, 4, 3, 2, 1, 1, 3, 2]
}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

Which yields this ts

            value
date             
2023-01-01      1
2023-02-01      3
2023-03-01      3
2023-04-01      4
2023-05-01      3
2023-06-01      2
2023-07-01      1
2023-08-01      1
2023-09-01      3
2023-10-01      2

image

and when I try and fit the model, it yields these information:

fitted_model = auto_arima(
                    y=df['value'],
                    max_iter=15,
                    max_d=1,
                    method='nm',
                    seasonal=False)
fitted_model

and when I try and fit the model, it yields these information:

ARIMA(2,0,2)(0,0,0)[0]          

Then I try to predict

fitted_model.predict(
                    n_periods=2,
                    return_conf_int=False)

and shows below error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [1047], line 1
----> 1 fitted_model.predict(
      2                     n_periods=2,
      3                     return_conf_int=False)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/pmdarima/arima/arima.py:791, in ARIMA.predict(self, n_periods, X, return_conf_int, alpha, **kwargs)
    788 arima = self.arima_res_
    789 end = arima.nobs + n_periods - 1
--> 791 f, conf_int = _seasonal_prediction_with_confidence(
    792     arima_res=arima,
    793     start=arima.nobs,
    794     end=end,
    795     X=X,
    796     alpha=alpha)
    798 if return_conf_int:
    799     # The confidence intervals may be a Pandas frame if it comes from
    800     # SARIMAX & we want Numpy. We will to duck type it so we don't add
    801     # new explicit requirements for the package
    802     return f, check_array(conf_int, force_all_finite=False)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/pmdarima/arima/arima.py:203, in _seasonal_prediction_with_confidence(arima_res, start, end, X, alpha, **kwargs)
    199     conf_int[:, 0] = f - q * np.sqrt(var)
    200     conf_int[:, 1] = f + q * np.sqrt(var)
    202 return check_endog(f, dtype=None, copy=False), \
--> 203     check_array(conf_int, copy=False, dtype=None)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/sklearn/utils/validation.py:899, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    893         raise ValueError(
    894             "Found array with dim %d. %s expected <= 2."
    895             % (array.ndim, estimator_name)
    896         )
    898     if force_all_finite:
--> 899         _assert_all_finite(
    900             array,
    901             input_name=input_name,
    902             estimator_name=estimator_name,
    903             allow_nan=force_all_finite == "allow-nan",
    904         )
    906 if ensure_min_samples > 0:
    907     n_samples = _num_samples(array)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/sklearn/utils/validation.py:146, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    124         if (
    125             not allow_nan
    126             and estimator_name
   (...)
    130             # Improve the error message on how to handle missing values in
    131             # scikit-learn.
    132             msg_err += (
    133                 f"\n{estimator_name} does not accept missing values"
    134                 " encoded as NaN natively. For supervised learning, you might want"
   (...)
    144                 "#estimators-that-handle-nan-values"
    145             )
--> 146         raise ValueError(msg_err)
    148 # for object dtype data, we only check for NaNs (GH-13254)
    149 elif X.dtype == np.dtype("object") and not allow_nan:

ValueError: Input contains NaN.

However when I increase the data by one data point

data = {
    'date': pd.date_range(start='2023-01-01', periods=11, freq='MS'),
    'value': [1, 3, 3, 4, 3, 2, 1, 1, 3, 2, 2]
}

or when I change to these values

data = {
    'date': pd.date_range(start='2023-01-01', periods=10, freq='MS'),
    'value': [5, 8, 11, 4, 6, 6, 6, 5, 6, 9]
}

or when setting the seasonal parameter to True for the same exact data

The model returned is ARIMA(0,0,0)(0,0,0)[0] intercept and the predictions are fine without errors


Another work around is to put a guradrail of maximum p, q, d to be 1 and it also works.

Can you help me understand why this happens? Is placing a guardrail the correct way to fix this?

Thank you in advance :)

Here is a video of a cute Otter as a digital bribe: https://www.youtube.com/watch?v=8O8iEz2p7rQ Can you help me understand this behaviour?

Versions (if necessary)

System:
    python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0]
executable: /home/trusted-service-user/cluster-env/clonedenv/bin/python
   machine: Linux-4.15.0-1174-azure-x86_64-with-glibc2.27

Python dependencies:
        pip: 23.3
 setuptools: 65.5.1
    sklearn: 1.1.3
statsmodels: 0.14.0
      numpy: 1.23.4
      scipy: 1.10.1
     Cython: 0.29.32
     pandas: 1.5.3
     joblib: 1.3.2
   pmdarima: 1.8.5
Linux-4.15.0-1174-azure-x86_64-with-glibc2.27
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0]
pmdarima 1.8.5
NumPy 1.23.4
SciPy 1.10.1
Scikit-Learn 1.1.3
Statsmodels 0.14.0
/home/trusted-service-user/cluster-env/clonedenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")