alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.59k stars 234 forks source link

arima.update without refitting? #256

Closed ihadanny closed 4 years ago

ihadanny commented 4 years ago

the arima.update code specifically says:

        # Get the model parameters, then we have to "fit" a new one. If you're
        # reading this source code, don't panic! We're not just fitting a new
        # arbitrary model. Statsmodels does not handle patching new samples in
        # very well, so we seed the new model with the existing parameters.

can you please explain the Statsmodels does not handle patching new samples in very well? if you can share the experience you had and what APIs did you try... It just doesn't make sense to me to waste even one MLE iteration on this, since I want the fitted model without any change - I just want to update the underlying endog and exog.

tgsmith61591 commented 4 years ago

Imagine we have the following:

>>> import pmdarima as pm
>>> y = pm.datasets.load_wineind()
>>> train, test = y[:150], y[150:]
>>> fit = pm.ARIMA((1, 1, 2), seasonal_order=(0, 1, 1, 12)).fit(train)

Now, let's try to add the test observations to the tail of the model's endog array. Fortunately for us, we can access that via fit.arima_res_.model.endog:

>>> fit.arima_res_.model.endog.shape
(150, 1)

Annoyingly, it adds a dimension, but no biggie. We can append easily enough:

>>> fit.arima_res_.model.endog = np.concatenate([
...     fit.arima_res_.model.endog,
...     test.reshape((-1, 1))])

Let's try to produce some new forecasts now:

>>> fit.predict(10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/lib/python3.7/site-packages/pmdarima/arima/arima.py", line 697, in predict
    alpha=alpha)
  File "/anaconda3/lib/python3.7/site-packages/pmdarima/arima/arima.py", line 83, in _seasonal_prediction_with_confidence
    **kwargs)
  File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py", line 1939, in get_prediction
    **kwargs)
  File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 2389, in get_prediction
    start, end + out_of_sample + 1, dynamic, **kwargs)
  File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/kalman_filter.py", line 1997, in predict
    raise ValueError(exception % name)
ValueError: Forecasting for models with time-varying state_intercept matrix requires an updated time-varying matrix for the period to be forecasted.

Yikes. If you dig into the statsmodels code for what's happening when you call predict, you're in for a treat. The reality is, statsmodels creates a lot of private instance attributes when you call fit, and we would have to do an unbelievable level of surgery on private variables to simple add to the endog/exog array.

If you don't believe me, check this out:

>>> fit.arima_res_.model.data.endog.shape
(150,)

What? Statsmodels creates another copy of the data under the hood that we'd have to change:

>>> fit.arima_res_.model.data.endog = np.concatenate([
...     fit.arima_res_.model.endog,
...     test.reshape((-1, 1))])

But try predict again and you'll get the same error.

if you can share the experience you had and what APIs did you try...

Go down this rabbit trail and you unfortunately will have the same experience, I guarantee it. Believe me, I have had heartburn about the update approach since its inception, but unless you're calling update hundreds or even thousands of times in a loop I have no idea how it could be so prohibitively slow.

Let's go back to the original model (before we borked it) and time an update:

import time

start = time.time()
fit.update(test, maxiter=10)
print(f"This took {time.time() - start:.3f} seconds")
# This took 0.355 seconds

If you ever figure out how to do this, we would love you to submit a PR, because you're right—it's a bit clunky. But if it's truly proving to be your largest bottleneck, you may wish to revisit how you're calling it.

ihadanny commented 4 years ago

I have no idea how it could be so prohibitively slow

  1. I have 2 fourier seasonality terms - 8X2 terms with m=24, and 6X2 with m=24*7, for a total of 28 fourier terms - In your example you had no exogenous data.
  2. My arima order is p,d,q = 4,0,4 and 0,0,0,0 seasonal order - about the same as in your example
  3. I'm fitting over a train window of 2016 samples, much more than your 150 samples.

With these parameters the fit takes ~15 seconds on my machine (with the default maxiter=50), and I'm also getting ConvergenceWarning: Maximum Likel ihood optimization failed to converge (perhaps this is the key to what happens next?)

Then I proceed to update the model for a period of 4 weeks (4X7X24 samples), each time getting a prediction for the next sample. THIS takes ~30 seconds for each update! more than the original fit! and I keep getting the annoying ConvergenceWarning for each and every sample.

do you have any clue to what's wrong? or should I triage it a bit more?

tgsmith61591 commented 4 years ago

I can't help you triage without data or code, otherwise I'm just speculating. I have also noticed that statsmodels is very liberal with convergence warnings. However, this is what I'm interested in:

each time getting a prediction for the next sample

If I understand you correctly, you're basically looping:

for each_new_sample in test:
    model.update([each_new_sample])
    model.predict(1)

Is that correct?

ihadanny commented 4 years ago

exactly

tgsmith61591 commented 4 years ago

Since I still don't have any code, all I can comment on is the block of pseudo code I wrote above. I'd encourage against that... you might just run update once and then get your predictions. When you run update, you also should probably be passing a very small number of iterations. I am 100% convinced the slowness is due to the number of computations happening in that very large loop. Going back to something I said in my first response:

If you ever figure out how to do this, we would love you to submit a PR, because you're right—it's a bit clunky

Perhaps you can figure out how to update the endog/exog arrays without extra iterations.

tgsmith61591 commented 4 years ago

I understand you're probably trying to come up with incremental error estimates, which necessitates the loop. But I simply don't know how you're going to make that much faster as things stand.

ihadanny commented 4 years ago

what do you mean by incremental error estimates? I'm running a tsCV experiment - I'm fitting a model to an initial training window, and then running the loop we talked about (with an additional complication - once every 3 days I'm refitting the model from scratch). This is exactly how I mean to use this model in production - Isn't this the most common usecase for forecasting models?