alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

ValueError: maxlag should be < nobs #179

Closed C0DK closed 5 years ago

C0DK commented 5 years ago

Describe the bug I get this annoying bug when trying to fit my data with my generated Arima: ValueError: maxlag should be < nobs

I am not entirely sure what it means, but upon googling i found this: The problem is that you need more observations to estimate the model. from here: https://github.com/statsmodels/statsmodels/issues/4465#issuecomment-380459136

The person also mentions that for a specific model atleast X observations are needed. Couldn't this relation be raised as an exception from your module? The exception is rather obscure, and from a lowlevel module. Either upon calling the initializer with bad data, or simply catching it and rephrasing it, in a language that relates more to the input i supply to your service.

I might just be a noob, regarding math, but the error isn't that useful currently. :/

To Reproduce
I've created a snippet that can be run, which throws the exception: https://gist.github.com/C0DK/6c21a2990b275c26779a5e157322e424

Stack trace

File "/usr/local/lib/python3.6/dist-packages/pmdarima/base.py", line 46, in fit_predict
    self.fit(y, exogenous, **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 439, in fit
    self._fit(y, exogenous, **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 354, in _fit
    fit, self.arima_res_ = _fit_wrapper()
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 348, in _fit_wrapper
    **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/mlemodel.py", line 445, in fit
    start_params = self.start_params
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 938, in start_params
    self.polynomial_ma, self.k_trend, trend_data
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 863, in _conditional_sum_squares
    X = np.c_[X, lagmat(residuals, k_ma)[r-k:, cols]]
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/tsatools.py", line 408, in lagmat
    raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs

Versions pmdarima 1.2.1 NumPy 1.17.0 SciPy 1.2.2 Scikit-Learn 0.21.3 Statsmodels 0.10.1

Expected behavior It throwing an exception that guides me towards what values are valid.

tgsmith61591 commented 5 years ago

tl;dr

The problem is that you're calling fit_predict on test data when you should call be calling predict on your model to get forecasted test values. fit_predict is for fitting and creating forecasts from your training samples (not test). When used as intended, the error is not raised.

Explanation

After looking at this, I don't think this is a bug. I think this is exactly the behavior that's expected... you have too few observations (nobs) for the number of lags (maxlag) your model has specified.

Now, the reason you're hitting that error is this line:

output = model.fit_predict(test_data, n_periods=2)

You just fit a model on 500 samples, and auto_arima picked out that the appropriate order should be (4, 0, 4) (model.order) and now you're throwing away the model fit to run fit_predict with a model with very large lag terms on just 10 test samples. If you just want to predict, call model.predict(n_periods=2).

Going back to the linked issue... if you follow Chad's equation, he estimates you need at least 14 samples to do what you're trying:

>>> p, d, q = model.order
>>> P, D, Q, s = model.seasonal_order
>>> d + D*s + max(3*q + 1, 3*Q*s + 1, p, P*s) + 1
14

Now, going back to the error message... If we were to hardcode a check for every possible input constraint that a dependency module sets, the library would become unmaintainable. We can't possibly curate a comprehensive list of every data permutation that will raise errors in lower-level libraries, so we trust their error handling for those situations.

Now, if you still disagree... PRs are always welcome

C0DK commented 5 years ago

Make sense. Thank you :)

C0DK commented 5 years ago

Just to explain, we already implemented ML algorithms (LST, GRU, CNN) for timeseries data. However ARIMA was wanted as well. all of the prior are (as you probably know) trained on a big dataset, which gives a model, whereafter that model can be used to, based on an input to predict the values following the current supplied values. This is the logic i am trying to replicate with ARIMA, where i first run it through a big dataset and then make N predictions (where N is the steps in my timeseries to validate that the model fits etc.)

The dataset is just a sinus curve, as it is predictable.

Is it, at al, translateable to auto arima, or am i totally missing the point? Can i atleast reuse the p, d, q values and use those for the following predictions?

tgsmith61591 commented 5 years ago

whereafter that model can be used to, based on an input to predict the values following the current supplied values

Yeah, I think you missed what I was getting at... you will never, for any ML model, fit your data on test data (i.e., do not use fit_predict on test data). That's what I was getting at. To produce forecasts, just use

# predict 10 steps in the future
>>> model.predict(n_periods=10)

So your whole pipeline would just be

model = pm.auto_arima(
    data,
    trace=True,
    error_action="ignore",
)

forecasts = model.predict(n_periods=2)

See the function doc or any of the examples in the documentation for more info.

C0DK commented 5 years ago

so you are saying i should generate a new auto arima for each step in my timeseries data?

Because my point is that i have a dataset. In this case 53 points, I want to make predictions throughout this dataset. i want to feed the model those 10 steps to predict the coming 5. I am not using the 10 steps as testing data - i am using the following five (which the model doesn't have access to at any given time. This is how it is done in our GRU, LSTM and CNN implementations, to create a RNN.

image see the image. between the two vertical lines is input data, and the red line represents the output data. The grey scaled data following the input "window" is then tested against. This then shows an animation where the two vertical lines move, representing a different input set, where i in data[i:i+10] increment by one at each step.

However now i am recreating the auto arima at each timestep. I just want to make sure that this is exactly what you mean is best practice, because in this case i create 43 different models, with a really small dataset. with a larger one it seems impossible to generate... But if that's how arima is supposed to work, then i'll do that. It just suprises me that i cannot reuse the generated values for the next prediction.