alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.6k stars 234 forks source link

Stepwise search with AICc fails in autoarima #468

Open Simply-Adi opened 3 years ago

Simply-Adi commented 3 years ago

Describe the question you have

I am implementing a backward feature elimination (BFE) involving autorima to find optimal parameters for a given set of regressors. While running the BFE, the following errors arises:

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\auto.py in auto_arima(y, X, start_p, d, start_q, max_p, max_d, max_q, start_P, D, start_Q, max_P, max_D, max_Q, max_order, m, seasonal, stationary, information_criterion, alpha, test, seasonal_test, stepwise, n_jobs, start_params, trend, method, maxiter, offset_test_args, seasonal_test_args, suppress_warnings, error_action, trace, random, random_state, n_fits, return_valid_fits, out_of_sample_size, scoring, scoring_args, with_intercept, sarimax_kwargs, **fit_args)
    715         )
    716 
--> 717     sorted_res = search.solve()
    718     return _return_wrapper(sorted_res, return_valid_fits, start, trace)
    719 

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\_auto_solvers.py in solve(self)
    310         # Null model with NO constant (if we haven't tried it yet)
    311         if self.with_intercept:
--> 312             if self._do_fit((0, d, 0), (0, D, 0, m), constant=False):
    313                 p = q = P = Q = 0
    314 

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\_auto_solvers.py in _do_fit(self, order, seasonal_order, constant)
    231             self.k += 1
    232 
--> 233             fit, fit_time, new_ic = self._fit_arima(
    234                 order=order,
    235                 seasonal_order=seasonal_order,

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\_auto_solvers.py in _fit_candidate_model(y, X, order, seasonal_order, start_params, trend, method, maxiter, fit_params, suppress_warnings, trace, error_action, out_of_sample_size, scoring, scoring_args, with_intercept, information_criterion, **kwargs)
    524     else:
    525         fit_time = time.time() - start
--> 526         ic = getattr(fit, information_criterion)()  # aic, bic, aicc, etc.
    527 
    528         # check the roots of the new model, and set IC to inf if the

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\utils\metaestimators.py in <lambda>(*args, **kwargs)
     51 
     52         # lambda, but not partial, allows help() to work with update_wrapper
---> 53         out = (lambda *args, **kwargs: self.fn(obj, *args, **kwargs))
     54         # update the docstring of the returned function
     55         update_wrapper(out, self.fn)

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\arima.py in aicc(self)
   1019         # this function to reflect other metric implementations if/when
   1020         # statsmodels incorporates AICc
-> 1021         return _aicc(self.arima_res_,
   1022                      self.nobs_,
   1023                      not self.with_intercept)

~\anaconda3\envs\py38_clone\lib\site-packages\pmdarima\arima\arima.py in _aicc(model_results, nobs, add_constant)
    121     if add_constant:
    122         add_constant += 1  # add one for constant term
--> 123     return aic + 2. * df_model * (nobs / (nobs - df_model - 1.) - 1.)
    124 
    125 

ZeroDivisionError: float division by zero

My initialization of autoriama is :

autoarimax = pm.auto_arima(y_train
                    X = X_train_sub,  
                    test='kpss',       # use adftest to find optimal 'd'
                    start_p=0, # initial guess for p
                    start_q=0, # initial guess for q
                    max_p=3, 
                    max_q=3, # maximum p and q
                    m=12,             
                    d=None,           # let model determine 'd', KPSS test (auto_arima default)
                    seasonal=True,  
                    stationary=False,
                    start_P=0, 
                    start_Q=0, # initial guess for Q
                    max_P=3, # max value of P to test
                    max_Q=3, # max value of Q to test
                    D=None, 
                    seasonal_test='ocsb', #ch test produce standardization warning
                    trace=True,
                    information_criterion='aicc',
                    error_action='trace',  
                    suppress_warnings=True, 
                    stepwise=True)

I tried diagnosing the problem. This error pops up when I am trying to run autoarima ( with AICc as criterion) with a specific subset of regressors. For the same subset, the error disappears when I use AIC as criterion.

Please help.

Versions (if necessary)

python version== 3.8.12
pandas==1.3.3
numpy==1.21.3
seaborn==0.11.0
patsy==0.5.2
xgboost==1.5.0
pmdarima==1.8.3
statsmodels.api==0.13.0
devindg commented 3 years ago

What a coincidence. I was just about to post a bug report about this exact issue. I'm also seeing this division by zero error when using AICc as the IC. Attached are the data in question. The first column "y" is the dependent variable, and the second column "x" is an exogenous predictor.

The number of observations in the data is 10, which is less than the seasonal frequency (12) of the data. The code that raises the division by zero error is below.

mod_new = pmd.auto_arima(y, X=x, m=12,
                     seasonal=True, test='kpss', seasonal_test='ch',
                     stepwise=True, with_intercept='auto',
                     information_criterion='aicc', 
                     error_action='ignore')

Even though the number of observations is less than the seasonal frequency, and seasonal=True, I would expect auto_arima() to fall back to a simple model like ARIMA(0,0,0)(0,0,0,0) and, if necessary, revert to AIC in the event of division by zero (due to num_obs - num_parms - 1 = 0).

At least for this data set, I think the division by zero error is the result of the data and not the other arguments passed to auto_arima(). For example, if the last observation in the data is ignored and passed to auto_arima() using the same arguments as above, no division by zero error is raised. Code below raises no error.

mod_new = pmd.auto_arima(y[:-1], X=x[:-1], m=12,
                     seasonal=True, test='kpss', seasonal_test='ch',
                     stepwise=True, with_intercept='auto',
                     information_criterion='aicc', 
                     error_action='ignore')

Versions of software:

statsmodels.api == 0.12.2
python == 3.9.7
pmdarima == 1.8.3
numpy == 1.21.2

data.csv

tgsmith61591 commented 3 years ago

Could you please provide a sample of data that reproduces this issue?

devindg commented 3 years ago

Could you please provide a sample of data that reproduces this issue?

I'm not sure if you were responding to OP or me, but I attached "data.csv" in my post.