alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

Difference between R and Python implementation of auto.arima #170

Closed sxooler closed 4 years ago

sxooler commented 5 years ago

Question In the documentation, there is written:

pmdarima is designed to behave as similarly to R’s well-known auto.arima as possible

Could someone answer the question, what are the differences?

  1. I tried to run simple ARIMA (order is not important at the moment) on my dataset (1000 data points). Something like that:
import pmdarima as pm
arima = pm.ARIMA(order=(0,1,1))
arima_fit = arima.fit(ts_data)

And it's pretty fast.

  1. When I use the same dataset and use auto_arima function (like pm.auto_arima(ts_data)), it's taking a bit more time (measured with timeit):

1.07 s ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

R implementation of auto.arima is roughly 10 times faster. What's the reason? Is there a way how to improve that?

  1. When I take the same dataset and use R and Python implementations of auto ARIMA I get (depends on data) different results. The default parameters seem to be the same. What's the reason for that?

Versions Windows-10-10.0.17134-SP0 Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] pmdarima 1.2.1 NumPy 1.16.4 SciPy 1.2.1 Scikit-Learn 0.20.2 Statsmodels 0.9.0

tgsmith61591 commented 5 years ago

All great questions, thanks for asking! Would you be able to provide a minimal reproducible sample of data? It will be easier for me to answer if I can put together some examples.

sxooler commented 5 years ago

I'm not sure if I can provide you with my dataset, but let's take a dataset from pmdarima. Let say wineind. If I try it with this Python implementation and with R, as I'm mentioning above, I experience the same (Python autofit is much slower and results are different). I was also curious why AIC and BIC numbers are different for the same order params and dataset in those implementations.

tgsmith61591 commented 5 years ago

Doing a side-by-side on wineind:

Python:

As stated, this does take a while. Timings for each model are shown via trace

# For exact replicability, you can run this in a docker image:
# $ docker run --rm -it tgsmith61591/pmdarima:1.2.1
import pmdarima as pm

y = pm.datasets.load_wineind()
model = pm.auto_arima(y, seasonal=True, m=12, suppress_warnings=True, trace=True)
# Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3069.879, BIC=3094.629, Fit time=1.109 seconds
# Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=3133.376, BIC=3139.564, Fit time=0.046 seconds
# Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=3099.734, BIC=3112.109, Fit time=0.289 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.930, BIC=3079.305, Fit time=0.262 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=3067.842, BIC=3083.311, Fit time=0.931 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=3090.512, BIC=3099.793, Fit time=0.051 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=3068.080, BIC=3083.549, Fit time=0.702 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=3067.847, BIC=3086.410, Fit time=5.332 seconds
# Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.760, BIC=3082.229, Fit time=0.851 seconds
# Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=3094.571, BIC=3106.946, Fit time=0.225 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.742, BIC=3085.305, Fit time=0.621 seconds
# Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3070.634, BIC=3095.384, Fit time=1.760 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3067.487, BIC=3089.143, Fit time=1.790 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=3090.957, BIC=3106.426, Fit time=0.178 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=3067.731, BIC=3089.387, Fit time=1.667 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=3068.627, BIC=3093.377, Fit time=4.410 seconds
# Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.787, BIC=3082.256, Fit time=0.334 seconds
# Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3068.673, BIC=3090.330, Fit time=0.818 seconds
# Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3068.810, BIC=3090.467, Fit time=0.876 seconds
# Total fit time: 22.316 seconds

R:

Returns almost immediately:

> library(forecast)
> auto.arima(wineind, seasonal=TRUE)
Series: wineind
ARIMA(1,1,2)(0,1,1)[12]

Coefficients:
         ar1      ma1     ma2     sma1
      0.4299  -1.4673  0.5339  -0.6600
s.e.  0.2984   0.2658  0.2340   0.0799

sigma^2 estimated as 5399312:  log likelihood=-1497.05
AIC=3004.1   AICc=3004.48   BIC=3019.57

Explanation and root cause analysis

First of all, when you cross the language border, there is not always such a thing as a 1-to-1 rewrite, so that is the core challenge here. Under the hood, we are using statsmodels to fit our ARMA, ARIMA and SARIMAX models, and what we're providing is the optimization code, tests of seasonality and stationarity, etc.

Just to make sure it isn't a bottleneck on our side, I ran an experiment on the slowest of the models shown above:

from statsmodels import api as sm
import time

def timed(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        res = func(*args, **kwargs)
        print(f"complete in {time.time() - start:.4f} seconds")
        return res
    return wrapper

@timed
def create_model():
    sm_sarimax = sm.tsa.statespace.SARIMAX(
        endog=y, exog=None, order=(0, 1, 1),
        seasonal_order=(1, 1, 2, 12), trend='c',
        enforce_stationarity=True)
    res_wrapper = sm_sarimax.fit(
        start_params=None, trend='c',
        method='lbfgs', transparams=True,
        solver='lbfgs', maxiter=50, disp=0)
    return res_wrapper

create_model()
# complete in 5.2814 seconds
# <statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper object at 0x7f33cefd2a50>

When you dig deeper, R's ARIMA code is almost 100% C, and statsmodels' is almost 100% python. While that will account for a lot of it, I'm not one to be swayed by that fact alone... my opinion is there is surely something going on in the SARIMAX class that is causing this to drag, and I don't have a perfect explanation for you right now.

RE: AIC, BIC, etc., my guess is that statsmodels computes those values in a slightly different fashion than statsmodels. I wish I had a better answer for you... Now, it could totally be that the manner in which we're using the SARIMAX interface (or one of the default options we're passing) is causing a slow-down, and that's something I'm not really certain about. But I do think it warrants maybe a deep dive by someone on our side to see if it's anything we have control over.

Thank you for bringing this up; hopefully we can get a satisfactory explanation, if not solution, to you in a reasonable time.

tgsmith61591 commented 5 years ago

This may actually be related to #146, which could greatly improve this performance...

sxooler commented 5 years ago

Thank you for your answer! I'm aware of the difference between C and Python performance, I wasn't surprised by the results (R is faster than Python) but by quite a huge difference. If it's in your hands to do something with it it would be awesome!

However, I have still one question regarding different results. In this case, parameters estimated by both implementations are the same although AICs is not equal. My guess is the same (there are probably minor difference in the computation). But sometimes I get also different params. Here is another example:

import pmdarima as pm

# the same data without last 2 years
y = pm.datasets.load_wineind()[:-24]
model = pm.auto_arima(y, seasonal=True, m=12, suppress_warnings=True, error_action='ignore')
model.summary()
#                                 Statespace Model Results                                 
#==========================================================================================
# Dep. Variable:                                  y   No. Observations:                  152
# Model:             SARIMAX(3, 1, 0)x(2, 0, 0, 12)   Log Likelihood               -1444.628
# Date:                            Thu, 18 Jul 2019   AIC                           2903.255
# Time:                                    16:38:14   BIC                           2924.376
# Sample:                                         0   HQIC                          2911.836
# ...

But for R it's quite different:

> Series: tswineind 
ARIMA(0,1,1)(0,1,2)[12] 

Coefficients:
          ma1     sma1     sma2
      -0.8971  -0.5291  -0.1059
s.e.   0.0332   0.0882   0.0853

sigma^2 estimated as 5575033:  log likelihood=-1279.02
AIC=2566.03   AICc=2566.33   BIC=2577.77

And when I take results from R and do this:

arima = pm.ARIMA(order=(0, 1, 1), seasonal_order=(0, 1, 2, 12))
arima_fit = arima.fit(y)
arima_fit.summary()
#                                 Statespace Model Results                                 
#==========================================================================================
#Dep. Variable:                                  y   No. Observations:                  152
#Model:             SARIMAX(0, 1, 1)x(0, 1, 2, 12)   Log Likelihood               -1305.513
#Date:                            Thu, 18 Jul 2019   AIC                           2621.026
#Time:                                    16:46:32   BIC                           2635.698
#Sample:                                         0   HQIC                          2626.988
# ...

It's obvious, I found better combination than auto arima in Python (this one was not in the output when trace=True). It seems to me there is not problem with different AIC results, but the algorithm has to differ in something. Do you know more about that?

aaronreidsmith commented 4 years ago

@tgsmith61591 Tagging you so our new Stale bot doesn't close this

tgsmith61591 commented 4 years ago

It seems to me there is not problem with different AIC results, but the algorithm has to differ in something. Do you know more about that?

There was a time when R added more rules to its stepwise search algorithm, so the two diverged. The new release in 1.5.0 will reconcile these differences. However, one reason the same algorithm may not even search the same orders is due to the AIC of the results. The stepwise search refines its search based on the AIC, and since R and Python have subtly different ways of computing it, the same progression of models may not yield the same refinement, if that makes sense.

Another thing I've noticed recently, is R is faster because it will approximate more liberally:

> auto.arima
function (y, d = NA, D = NA, max.p = 5, max.q = 5, max.P = 2,
    max.Q = 2, max.order = 5, max.d = 2, max.D = 1, start.p = 2,
    start.q = 2, start.P = 1, start.Q = 1, stationary = FALSE,
    seasonal = TRUE, ic = c("aicc", "aic", "bic"), stepwise = TRUE,
    trace = FALSE, approximation = (length(x) > 150 | frequency(x) >
        12), truncate = NULL, xreg = NULL, test = c("kpss", "adf",
        "pp"), seasonal.test = c("seas", "ocsb", "hegy", "ch"),
    allowdrift = TRUE, allowmean = TRUE, lambda = NULL, biasadj = FALSE,
    parallel = FALSE, num.cores = 2, x = y, ...)
{
    ...
}

Notice approximation = (length(x) > 150 | frequency(x) > 12). The python version does not approximate unless you explicitly pass simple_differencing=True, and even then the results still differ slightly. That has to do with the number of iterations run in the optimizer. R's BFGS optimization defaults to 100 iterations (run optim in R to see the function), while pmdarima's ARIMA defaults to 50 iterations. If you set simple_differencing=True, you might also crank up the maxiter param to refine it some more.