alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

Getting nans as output for predict #101

Closed MelissaBain closed 4 years ago

MelissaBain commented 5 years ago

Description

The predict function is returning nans despite the values the model fits on having no missing data. The output changes if I include a few more values in the fitted data.

Steps/Code to Reproduce

Example:

from numpy.linalg import LinAlgError
from pmdarima.arima import auto_arima

trainValues = [144.5, 142. , 139. , 137. , 135. , 135. , 134. , 134. , 133. ,
       132. , 133. , 134. , 137. , 135. , 134. , 134. , 137. , 136. ,
       136. , 125. , 120. , 120. , 121. , 120. , 116. , 119. , 125. ,
       134. , 124. , 113. , 110. , 109. , 110. , 111. , 106. , 106. ,
       102. , 101. , 102. , 105. , 105. , 108. , 110. , 109. , 113. ,
       119. , 120. , 118. ]

fitValues = [115., 126., 142., 146., 165., 175., 184., 192., 200., 208., 215.,
       222., 227., 231., 233., 235., 234., 231., 229., 232., 234., 232.,
       223., 212., 207., 200., 191., 191., 194., 195.]

model = auto_arima(trainValues, seasonal=False)

#this is where the issue occurs.
fittedModel = model.fit(fitValues[0:27]) #this fits successfully 
print(fittedModel.predict(n_periods=2)) #returns nans

#I would have expected either some sort of error to be thrown or for predicted values 
#to be returned. I can get these expected types of output to occur jus by subsetting a 
#larger amount of the data I'm fitting on. (See below for examples that produce more 
#expected outcomes)

try:
    print(model.fit(fitValues[0:28]).predict(n_periods=2))
except LinAlgError:
    print("LinAlgError")

print(model.fit(fitValues[0:29]).predict(n_periods=2))

Expected Results

I would expect either an error to be thrown if there's an issue with SVD, stationarity, etc. or if the model was fit successfully (which I believe it was), I would expect valid numeric predictions.

Actual Results

[nan nan]

Versions

Linux-4.4.0-139-generic-x86_64-with-debian-stretch-sid Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 21:41:56) [GCC 7.3.0] pmdarima 1.1.0 NumPy 1.15.2 SciPy 1.1.0 Scikit-Learn 0.20.0 Statsmodels 0.9.0

aaronreidsmith commented 5 years ago

Hey, thanks for the well-formatted issue! On my end, I get a lot of errors thrown from running your example. Are you by any chance using error_action="ignore" or suppressing warnings in some other way?

>>> from numpy.linalg import LinAlgError
>>> from pmdarima.arima import auto_arima

>>> trainValues = [144.5, 142. , 139. , 137. , 135. , 135. , 134. , 134. , 133. ,
...        132. , 133. , 134. , 137. , 135. , 134. , 134. , 137. , 136. ,
...        136. , 125. , 120. , 120. , 121. , 120. , 116. , 119. , 125. ,
...        134. , 124. , 113. , 110. , 109. , 110. , 111. , 106. , 106. ,
...        102. , 101. , 102. , 105. , 105. , 108. , 110. , 109. , 113. ,
...        119. , 120. , 118. ]

>>> fitValues = [115., 126., 142., 146., 165., 175., 184., 192., 200., 208., 215.,
...        222., 227., 231., 233., 235., 234., 231., 229., 232., 234., 232.,
...        223., 212., 207., 200., 191., 191., 194., 195.]

>>> model = auto_arima(trainValues, seasonal=False)
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/pmdarima/arima/auto.py:870: ModelFitWarning: Unable to fit ARIMA for order=(1, 1, 2); data is likely non-stationary. (if you do not want to see these warnings, run with error_action="ignore")
  ModelFitWarning)
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/base/model.py:488: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)

>>> fittedModel = model.fit(fitValues[0:27]) #this fits successfully
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:674: RuntimeWarning: invalid value encountered in double_scalars
  tmp[kiter] = (macoefs[kiter]-b *macoefs[j-kiter-1])/(1-b**2)
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:676: RuntimeWarning: divide by zero encountered in true_divide
  invmacoefs = -np.log((1-macoefs)/(1+macoefs))
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:650: RuntimeWarning: invalid value encountered in true_divide
  newparams = ((1-np.exp(-params))/(1+np.exp(-params))).copy()
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:651: RuntimeWarning: invalid value encountered in true_divide
  tmp = ((1-np.exp(-params))/(1+np.exp(-params))).copy()
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tools/numdiff.py:243: RuntimeWarning: invalid value encountered in add
  **kwargs)).imag/2./hess[i, j]
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tools/numdiff.py:243: RuntimeWarning: invalid value encountered in multiply
  **kwargs)).imag/2./hess[i, j]
/Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/base/model.py:488: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)

>>> print(fittedModel.predict(n_periods=2))
[nan nan]
MelissaBain commented 5 years ago

I didn’t have error_action set to ignore, but I realized early in my notebook I had a broader warning suppression turned on because a different package kept giving me future warnings. My understanding was that auto-arima tries a bunch of different permutations of parameters, and returns the models in the order of the lowest AIC. For the first warning where it says ModelFitWarning due to stationarity and there’s an issue with the hessian, it just won’t return that as a valid fitted model, right? (There’s nothing intrinsically bad about it trying parameter combinations that don’t fit, right?). I had trace turned on for a while and saw that some of the parameter permutations didn’t work. Is that warning a problem? I thought it was natural that some permutations wouldn’t work.

As for the second bunch of warnings (the runtime warnings), what’s the difference between when it throws a warning versus and error? Right now I have code that is cycling through the models from best to worse until it finds one that doesn’t throw an error. In this case, since it just threw a warning message, it was accepted (where as the example below that has the SVD LinAlg Error would have been caught).

Sorry to have bothered you, given that it turned out to be an issue with my code and not your package. This was actually the first time I’ve ever filed a git issue (I’m a recent college grad that didn’t major in CS), so I really appreciate your kindness! I didn’t mean to turn this into a stack overflow type problem instead of finding a true bug!

On Feb 16, 2019, at 9:46 AM, Aaron Smith notifications@github.com wrote:

Hey, thanks for the well-formatted issue! On my end, I get a lot of errors thrown from running your example. Are you by any chance using error_action="ignore"

from numpy.linalg import LinAlgError from pmdarima.arima import auto_arima

trainValues = [144.5, 142. , 139. , 137. , 135. , 135. , 134. , 134. , 133. , ... 132. , 133. , 134. , 137. , 135. , 134. , 134. , 137. , 136. , ... 136. , 125. , 120. , 120. , 121. , 120. , 116. , 119. , 125. , ... 134. , 124. , 113. , 110. , 109. , 110. , 111. , 106. , 106. , ... 102. , 101. , 102. , 105. , 105. , 108. , 110. , 109. , 113. , ... 119. , 120. , 118. ]

fitValues = [115., 126., 142., 146., 165., 175., 184., 192., 200., 208., 215., ... 222., 227., 231., 233., 235., 234., 231., 229., 232., 234., 232., ... 223., 212., 207., 200., 191., 191., 194., 195.]

model = auto_arima(trainValues, seasonal=False) /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/pmdarima/arima/auto.py:870: ModelFitWarning: Unable to fit ARIMA for order=(1, 1, 2); data is likely non-stationary. (if you do not want to see these warnings, run with error_action="ignore") ModelFitWarning) /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/base/model.py:488: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available 'available', HessianInversionWarning)

fittedModel = model.fit(fitValues[0:27]) #this fits successfully /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:674: RuntimeWarning: invalid value encountered in double_scalars tmp[kiter] = (macoefs[kiter]-b *macoefs[j-kiter-1])/(1-b2) /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:676: RuntimeWarning: divide by zero encountered in true_divide invmacoefs = -np.log((1-macoefs)/(1+macoefs)) /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:650: RuntimeWarning: invalid value encountered in true_divide newparams = ((1-np.exp(-params))/(1+np.exp(-params))).copy() /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tsa/tsatools.py:651: RuntimeWarning: invalid value encountered in true_divide tmp = ((1-np.exp(-params))/(1+np.exp(-params))).copy() /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tools/numdiff.py:243: RuntimeWarning: invalid value encountered in add kwargs)).imag/2./hess[i, j] /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/tools/numdiff.py:243: RuntimeWarning: invalid value encountered in multiply **kwargs)).imag/2./hess[i, j] /Users/asmith/envs/pmdarima/lib/python3.5/site-packages/statsmodels/base/model.py:488: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available 'available', HessianInversionWarning)

print(fittedModel.predict(n_periods=2)) [nan nan] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tgsmith61591/pmdarima/issues/101#issuecomment-464357199, or mute the thread https://github.com/notifications/unsubscribe-auth/AWHs7XW8oaF_Tu7Avs6GEnYDtP6LhSt8ks5vOCfggaJpZM4a91ja.

tgsmith61591 commented 5 years ago

Hey Melissa, welcome to the world of open source 😄 I think Aaron is correct that these are a product of the HessianInversionWarnings in the statsmodels package under the hood, but here are some more clarifying points:

My understanding was that auto-arima tries a bunch of different permutations of parameters, and returns the models in the order of the lowest AIC

This is correct. In addition to AIC, however, you could specify BIC or AICc if you wanted. Not super relevant to your issue, but just in case you weren't aware.

For the first warning where it says ModelFitWarning due to stationarity and there’s an issue with the hessian, it just won’t return that as a valid fitted model, right?

This is correct. When pmdarima detects some of these warnings or errors, it will raise its own warning with a (hopefully) more relevant, helpful message. There is nothing intrinsically bad about these warnings. They often mean your data doesn't pass some test of stationarity with the current candidate parameters so we can't perform time series analysis on them, or that your data may follow a different pattern, like a simple polynomial.

As for the second bunch of warnings (the runtime warnings), what’s the difference between when it throws a warning versus and error?

Unfortunately, this is a design choice made by lower-level package authors. Pmdarima depends on NumPy and statsmodels, both great packages, but which handle errors differently. Consider the following numpy code:

>>> import numpy as np
>>> np.linalg.inv(np.zeros((5, 5)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/.../site-packages/numpy/linalg/linalg.py", line 532, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
  File "/anaconda3/envs.../site-packages/numpy/linalg/linalg.py", line 89, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")
numpy.linalg.linalg.LinAlgError: Singular matrix

Numpy will fail out with an error for an operation that cannot be completed, while in many cases statsmodels chooses to warn instead. In fact, you can see the actual code that handles this error handling in auto arima as of v1.1.0 here:

def _fit_arima(x, xreg, order, seasonal_order, start_params, trend,
               method, transparams, solver, maxiter, disp, callback,
               fit_params, suppress_warnings, trace, error_action,
               out_of_sample_size, scoring, scoring_args, with_intercept):
    start = time.time()
    try:
        fit = ARIMA(order=order, seasonal_order=seasonal_order,
                    start_params=start_params, trend=trend, method=method,
                    transparams=transparams, solver=solver, maxiter=maxiter,
                    disp=disp, callback=callback,
                    suppress_warnings=suppress_warnings,
                    out_of_sample_size=out_of_sample_size, scoring=scoring,
                    scoring_args=scoring_args,
                    with_intercept=with_intercept)\
            .fit(x, exogenous=xreg, **fit_params)

    # for non-stationarity errors or singular matrices, return None
    except (LinAlgError, ValueError) as v:
        if error_action == 'warn':
            warnings.warn(_fmt_warning_str(order, seasonal_order),
                          ModelFitWarning)
        elif error_action == 'raise':
            # todo: can we do something more informative in case
            # the error is not on the pmdarima side?
            raise v
        # if it's 'ignore' or 'warn', we just return None
        fit = None

    # do trace
    if trace:
        print('Fit ARIMA: %s; AIC=%.3f, BIC=%.3f, Fit time=%.3f seconds'
              % (_fmt_order_info(order, seasonal_order),
                 fit.aic() if fit is not None else np.nan,
                 fit.bic() if fit is not None else np.nan,
                 time.time() - start if fit is not None else np.nan))

    return fit

There are some statsmodels operations which will internally raise errors (ValueErrors), which we've handled in that block, but I think what you've identified here may be another case we need to handle. If there are Hessian inversion warnings, there should not be a way for us to really recover for that parameter combination... ideally this would be an error inside of statsmodels, but they left it at warning.

I'm going to add a handler for catching said warnings for our v1.2.0 release. Will leave this issue open in the meantime for related comments or discoveries...

MelissaBain commented 5 years ago

Thank you so much for your thorough explanation! There is one thing I'm still confused about: if the Nans were being produced due to a Hessian inversion warning, does that mean the values produced by ARIMA in the stats modeling package shouldn't be thought of as valid predictions? Consider the following code:


from pmdarima.arima import auto_arima
from statsmodels.tsa.arima_model import ARIMA

trainValues = [144.5, 142. , 139. , 137. , 135. , 135. , 134. , 134. , 133. ,
       132. , 133. , 134. , 137. , 135. , 134. , 134. , 137. , 136. ,
       136. , 125. , 120. , 120. , 121. , 120. , 116. , 119. , 125. ,
       134. , 124. , 113. , 110. , 109. , 110. , 111. , 106. , 106. ,
       102. , 101. , 102. , 105. , 105. , 108. , 110. , 109. , 113. ,
       119. , 120. , 118. ]

fitValues = [115., 126., 142., 146., 165., 175., 184., 192., 200., 208., 215.,
       222., 227., 231., 233., 235., 234., 231., 229., 232., 234., 232.,
       223., 212., 207., 200., 191., 191., 194., 195.]

model = auto_arima(trainValues, seasonal=False)

order = model.order
model = ARIMA(fitValues, order=order)
model_fit = model.fit(disp=0)
model_fit.forecast(steps = 2)

This is the same as what I originally posted except instead of using the fit and predict methods build into auto-arima, I extracted the order and used the arima method from the stats modeling package. In this case, I don't get a warning, nor is the output Nans. If the underlying issue is the hessian inversion, wouldn't I expect to see that warning when I use this other arima function on the same data with the same parameters? Should this new non-nan output be considered valid?

aaronreidsmith commented 4 years ago

@tgsmith61591 Tagging you so our new Stale bot doesn't close this

tgsmith61591 commented 4 years ago

On master, I no longer get nans. May have to do with the fact that we recently switched to SARIMAX for everything under the hood (#219, to be released in 1.5.0).

>>> print(fittedModel.predict(n_periods=2))
[182.60568574 173.73338402]