Closed ihadanny closed 4 years ago
Imagine we have the following:
>>> import pmdarima as pm
>>> y = pm.datasets.load_wineind()
>>> train, test = y[:150], y[150:]
>>> fit = pm.ARIMA((1, 1, 2), seasonal_order=(0, 1, 1, 12)).fit(train)
Now, let's try to add the test
observations to the tail of the model's endog array. Fortunately for us, we can access that via fit.arima_res_.model.endog
:
>>> fit.arima_res_.model.endog.shape
(150, 1)
Annoyingly, it adds a dimension, but no biggie. We can append easily enough:
>>> fit.arima_res_.model.endog = np.concatenate([
... fit.arima_res_.model.endog,
... test.reshape((-1, 1))])
Let's try to produce some new forecasts now:
>>> fit.predict(10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/lib/python3.7/site-packages/pmdarima/arima/arima.py", line 697, in predict
alpha=alpha)
File "/anaconda3/lib/python3.7/site-packages/pmdarima/arima/arima.py", line 83, in _seasonal_prediction_with_confidence
**kwargs)
File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py", line 1939, in get_prediction
**kwargs)
File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 2389, in get_prediction
start, end + out_of_sample + 1, dynamic, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/kalman_filter.py", line 1997, in predict
raise ValueError(exception % name)
ValueError: Forecasting for models with time-varying state_intercept matrix requires an updated time-varying matrix for the period to be forecasted.
Yikes. If you dig into the statsmodels code for what's happening when you call predict
, you're in for a treat. The reality is, statsmodels creates a lot of private instance attributes when you call fit
, and we would have to do an unbelievable level of surgery on private variables to simple add to the endog/exog array.
If you don't believe me, check this out:
>>> fit.arima_res_.model.data.endog.shape
(150,)
What? Statsmodels creates another copy of the data under the hood that we'd have to change:
>>> fit.arima_res_.model.data.endog = np.concatenate([
... fit.arima_res_.model.endog,
... test.reshape((-1, 1))])
But try predict
again and you'll get the same error.
if you can share the experience you had and what APIs did you try...
Go down this rabbit trail and you unfortunately will have the same experience, I guarantee it. Believe me, I have had heartburn about the update
approach since its inception, but unless you're calling update
hundreds or even thousands of times in a loop I have no idea how it could be so prohibitively slow.
Let's go back to the original model (before we borked it) and time an update:
import time
start = time.time()
fit.update(test, maxiter=10)
print(f"This took {time.time() - start:.3f} seconds")
# This took 0.355 seconds
If you ever figure out how to do this, we would love you to submit a PR, because you're right—it's a bit clunky. But if it's truly proving to be your largest bottleneck, you may wish to revisit how you're calling it.
I have no idea how it could be so prohibitively slow
With these parameters the fit takes ~15 seconds on my machine (with the default maxiter=50), and I'm also getting ConvergenceWarning: Maximum Likel ihood optimization failed to converge
(perhaps this is the key to what happens next?)
Then I proceed to update the model for a period of 4 weeks (4X7X24 samples), each time getting a prediction for the next sample. THIS takes ~30 seconds for each update! more than the original fit! and I keep getting the annoying ConvergenceWarning
for each and every sample.
do you have any clue to what's wrong? or should I triage it a bit more?
I can't help you triage without data or code, otherwise I'm just speculating. I have also noticed that statsmodels is very liberal with convergence warnings. However, this is what I'm interested in:
each time getting a prediction for the next sample
If I understand you correctly, you're basically looping:
for each_new_sample in test:
model.update([each_new_sample])
model.predict(1)
Is that correct?
exactly
Since I still don't have any code, all I can comment on is the block of pseudo code I wrote above. I'd encourage against that... you might just run update
once and then get your predictions. When you run update
, you also should probably be passing a very small number of iterations. I am 100% convinced the slowness is due to the number of computations happening in that very large loop. Going back to something I said in my first response:
If you ever figure out how to do this, we would love you to submit a PR, because you're right—it's a bit clunky
Perhaps you can figure out how to update the endog/exog arrays without extra iterations.
I understand you're probably trying to come up with incremental error estimates, which necessitates the loop. But I simply don't know how you're going to make that much faster as things stand.
what do you mean by incremental error estimates
? I'm running a tsCV
experiment - I'm fitting a model to an initial training window, and then running the loop we talked about (with an additional complication - once every 3 days I'm refitting the model from scratch). This is exactly how I mean to use this model in production - Isn't this the most common usecase for forecasting models?
the
arima.update
code specifically says:can you please explain the
Statsmodels does not handle patching new samples in very well
? if you can share the experience you had and what APIs did you try... It just doesn't make sense to me to waste even one MLE iteration on this, since I want the fitted model without any change - I just want to update the underlyingendog
andexog
.