[BUG] ARIMA MLE convergence failure in a case where it is not obvious why it would fail to converge

fkiraly commented 2 years ago

Describe the bug

Pasting some breaking code, reduced from sktime code posted there by @garrus990, here: https://github.com/alan-turing-institute/sktime/issues/1871

A seemingly innocuous application of ARIMA leads to what looks like an MLE convergence failure.

To Reproduce

from pmdarima.arima.arima import ARIMA
import numpy as np
import pandas as pd
from io import StringIO

txt = '   val\n3066.3\n3260.2\n3573.7\n3423.6\n3598.5\n3802.8\n3353.4\n4026.1\n4684.0\n4099.1\n3883.1\n3801.5\n3104.0\n3574.0\n3397.2\n3092.9\n3083.8\n3106.7\n2939.6'
arima_input_data = pd.read_csv(StringIO(txt))

ARIMA(order=(0, 1, 5), with_intercept=False).fit(arima_input_data).predict()

Versions

most recent

Expected Behavior

this produces a forecast instead of crashing

Actual Behavior

causes traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Workspace\sktime\sktime\utils\_testing\estimator_checks.py in <module>
----> 1 ARIMA(order=(0, 1, 5), with_intercept=False).fit(arima_input_data).predict()

C:\ProgramData\Anaconda3\envs\sktime-dl\lib\site-packages\pmdarima\arima\arima.py in predict(self, n_periods, X, return_conf_int, alpha, **kwargs)
    789         end = arima.nobs + n_periods - 1
    790 
--> 791         f, conf_int = _seasonal_prediction_with_confidence(
    792             arima_res=arima,
    793             start=arima.nobs,

C:\ProgramData\Anaconda3\envs\sktime-dl\lib\site-packages\pmdarima\arima\arima.py in _seasonal_prediction_with_confidence(arima_res, start, end, X, alpha, **kwargs)
    201 
    202     return check_endog(f, dtype=None, copy=False), \
--> 203         check_array(conf_int, copy=False, dtype=None)
    204 
    205 

C:\ProgramData\Anaconda3\envs\sktime-dl\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    798 
    799         if force_all_finite:
--> 800             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    801 
    802     if ensure_min_samples > 0:

C:\ProgramData\Anaconda3\envs\sktime-dl\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Additional Context

Workaround: apparently, appending any three values to the series makes it work. Mysterious.

I also do not see an obvious "degrees of freedom" problem here, e.g., why the MLE would be singular.

tgsmith61591 commented 2 years ago

Taking a look at this @fkiraly, I can reproduce.

tgsmith61591 commented 2 years ago

Just a quick follow-up, RCA still in progress. This is ultimately the model that is fit, and it fails at the statsmodels level as well.

Distilled, and setting disp to a high value to get LBFGS debugging:

import numpy as np
from statsmodels import api as sm

X = None
y = np.array([
    3066.3, 3260.2, 3573.7, 3423.6, 3598.5, 3802.8, 3353.4, 4026.1,
    4684. , 4099.1, 3883.1, 3801.5, 3104. , 3574. , 3397.2, 3092.9,
    3083.8, 3106.7, 2939.6
])

arima = sm.tsa.statespace.SARIMAX(
    endog=y,
    exog=X,
    order=(0, 1, 5),
    seasonal_order=(0, 0, 0, 0),
    trend=None,
)

arima_results = arima.fit(
    start_params=None,
    method="lbfgs",
    maxiter=50,
    disp=5,
)

This is the error I am seeing:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            6     M =           10
 This problem is unconstrained.

At X0         0 variables are exactly at the bounds

At iterate    0    f=          NaN    |proj g|=          NaN

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    6      1     21      1     0     0         NaN         NaN
  F =                       NaN

ABNORMAL_TERMINATION_IN_LNSRCH

 Line search cannot locate an adequate point after MAXLS
  function and gradient evaluations.
  Previous x, f and g restored.
 Possible causes: 1 error in function or gradient evaluation;
                  2 rounding error dominate computation.

Seems potentially related to the way the statsmodels fit routine is computing the start params:

In [91]: arima.start_params
/opt/miniconda3/envs/ml/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py:902: RuntimeWarning: Mean of empty slice.
  params_variance = (residuals[k_params_ma:] ** 2).mean()
/opt/miniconda3/envs/ml/lib/python3.7/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/opt/miniconda3/envs/ml/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
  warn('Non-invertible starting MA parameters found.'
Out[91]: array([-0.,  0., -0., -0.,  0., nan])

tgsmith61591 commented 2 years ago

Opened a new bug with statsmodels:

https://github.com/statsmodels/statsmodels/issues/8232

tgsmith61591 commented 2 years ago

Just an update here. The statsmodels issue was closed about 2 weeks ago, but still awaiting a new release before we bump our dependency version and mark this resolved.

noahberhe commented 2 years ago

Thanks. So it'll be in next autoarima release? Any ideas when? Keen to pick up the fix

tgsmith61591 commented 2 years ago

@noahberhe Statsmodels still hasn't cut a new release yet, so even if we shipped today the issue would still be present. We're waiting until their next release to pin the updated version

Mohamed-Abdellahi commented 5 months ago

Hello, i'm working on a electricty forecasting project.. and i still got a probleme in the convergence.. i use autoarima to find teh optimal (p, d, q) ... what did you guys do ??

alkaline-ml / pmdarima