alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.55k stars 228 forks source link

How to make `predict_in_sample()` derived from the first `d` term of original data when differencing order (`d`) isn't zero #533

Open theabc50111 opened 1 year ago

theabc50111 commented 1 year ago

Describe the question you have

I think the the usage of predict_in_sample() in pmdarima(python pkg) is same as fitted() in forecast(R pkg).

But I find out that the output of predict_in_sample() in pmdarima(python pkg) is different with output of fitted() in forecast(R pkg) when difference order isn't zero.

I use the following python codes to generate the output of predict_in_sample() in three different differencing order (d) :

import numpy as np
from pmdarima.arima import ARIMA, auto_arima

for diff_ord in range(1,4):
    model = ARIMA(order=(2,diff_ord,1), out_of_sample_size=0, mle_regression=True, suppress_warnings=True)

    ori_time_series = np.array([0.49958017, 0.15162735, 0.86757565, 0.3093554, 0.20545085, -0.48288408, 0.6880291,
                                0.8461229, 0.8320223, -0.7372907, 0.6048833, 0.40874475, 0.57708055, 0.27590698,
                                -0.21213382, 0.4236031, 0.3324298, -0.076647766, -0.20372462, 0.93162024, 0.5740154])

    model = model.fit(ori_time_series)
    pred_in_sample = np.array(model.predict_in_sample())
    print(f"pred_in_sample: {list(pred_in_sample)}")

I copy the output of above python codes, then paste to a R script to compare the difference between predict_in_sample() in pmdarima and fitted() in forecast(R pkg), R script:

par(mfrow=c(3,2))

resid_py_arr = rbind(c(0.5074603452679165, -0.34007254404457005, 0.5791635465131197, -0.2603422455105438, -0.11715725264909946, -0.9906800274596668, 0.21001659120388472, 0.3475615524863871, 0.7611416545018854, -0.8381194770281715, 0.2623376398458589, -0.23206647709890738, 0.4220053862092831, 0.06219916252952257, -0.4222181747488597, 0.038538708232362495, -0.08313995790478867, -0.26009526647620496, -0.48253573807639416, 0.5137650759426555, 0.34360142378578173),
                     c(0.49261470998182477, -0.6081910499091976, 1.0569346650372444, -0.6076380990855064, -0.2378976168316683, -0.8352597431282236, 1.0357942689078028, 0.631678563908118, 0.2767179073401864, -1.6808983058879592, 0.5653496038522947, 0.09552482041758414, 0.3659123662710821, -0.3491598976266268, -0.6745863302200333, 0.27064401578860775, 0.05909012652824919, -0.3734973292295873, -0.4340483262207756, 0.9099684919779161, 0.09388976669825833),
                     c(0.4940135127419584, -0.8586664744669461, 1.3118624972776918, -2.3436128660405227, -0.11312237780455078, -0.2567555216685367, 1.8351322442185336, 0.16659999993838948, -0.7634839186197411, -2.006759899752263, 1.8390440448281202, 0.3681746393692201, -0.30073609275686164, -0.6107051995729558, -0.5143875708651311, 0.8646807292504737, 0.03706843182642394, -0.7386026479980223, -0.1930232885395262, 1.3558665622526413, -0.579683995736004))

fitted_val_py_arr = rbind(c(-0.007880175267916512, 0.49169989404457004, 0.28841210348688023, 0.5696976455105438, 0.32260810264909945, 0.5077959474596667, 0.47801250879611523, 0.4985613475136129, 0.07088064549811457, 0.1008287770281715, 0.3425456601541411, 0.6408112270989074, 0.15507516379071695, 0.21370781747047746, 0.21008435474885973, 0.3850643917676375, 0.4155697579047887, 0.18344750047620495, 0.27881111807639414, 0.4178551640573445, 0.23041397621421822),
                          c(0.006965460018175224, 0.7598183999091976, -0.18935901503724428, 0.9169934990855064, 0.4433484668316683, 0.35237566312822355, -0.34776516890780296, 0.21444433609188196, 0.5553043926598136, 0.9436076058879592, 0.0395336961477053, 0.31321992958241585, 0.21116818372891794, 0.6250668776266268, 0.4624525102200333, 0.15295908421139226, 0.2733396734717508, 0.2968495632295873, 0.23032370622077558, 0.02165174802208386, 0.48012563330174163),
                          c(0.005566657258041537, 1.0102938244669462, -0.44428684727769174, 2.6529682660405225, 0.3185732278045508, -0.22612855833146328, -1.1471031442185335, 0.6795229000616105, 1.595506218619741, 1.2694691997522631, -1.2341607448281202, 0.040570110630779865, 0.8778166427568617, 0.8866121795729558, 0.30225375086513107, -0.44107762925047367, 0.29536136817357606, 0.6619548819980223, -0.010701331460473806, -0.4242463222526414, 1.153699395736004))

y<-c(0.49958017, 0.15162735, 0.86757565, 0.3093554, 0.20545085, -0.48288408, 0.6880291,
     0.8461229, 0.8320223, -0.7372907, 0.6048833, 0.40874475, 0.57708055, 0.27590698,
     -0.21213382, 0.4236031, 0.3324298, -0.076647766, -0.20372462, 0.93162024, 0.5740154)  # The statistical part of the question is understanding that the in-sample one-step-ahead forecasts of an ARIMA model are actually the fitted values of that model. In R, the method fitted applied on model output object normally returns the fitted values of the model. However, the method is not applicable to the output of function arima. There is a workaround: fitted values equal original values minus residuals. Residuals can be extracted from a fitted object using the method residuals (and that applies to the output of function arima).

for (dif_ord in seq(1:3)) {
  #  Better still, use the forecast package which does have a fitted method for outputs from Arima and auto.arima. – Rob Hyndman Feb 26, 2016 at 9:49
  #install.packages('forecast')
  library(forecast)
  fit.model.2 <- Arima(y, order = c(2, dif_ord, 1))

  resid_r_forecast_arima <- residuals(fit.model.2)
  resid_py <- resid_py_arr[dif_ord,]

  plot.ts(y, xaxp = c(0, 21, 21), ylim = c(-2,2))
  axis(2, at = seq(-1.5, 1.5, 0.5), tck = 1, lty = 2, col = "grey", labels = NA)  # Add horizontal grid 
  axis(1, at = 1:21, tck = 1, lty = 2, col = "grey", labels = NA)  # Add vertical grid
  lines(resid_r_forecast_arima, col=2, lty=2)
  lines(resid_py, col=3, lty=3)
  legend(13, -1, c("origin", "resid_r", "resid_py"), col=1:3, lty=1:3, cex=1, ncol=3, y.intersp=0, x.intersp=0, text.width=0.9)
  mtext(paste("Check residual series trend for diff order:", dif_ord))

  fitted_val_r_forecast_arima <- fitted(fit.model.2)
  fitted_val_py <- fitted_val_py_arr[dif_ord,]
  plot.ts(y, xaxp = c(0, 21, 21), ylim = c(-2,2))
  axis(2, at = seq(-1.5, 1.5, 0.5), tck = 1, lty = 2, col = "grey", labels = NA)  # Add horizontal grid 
  axis(1, at = 1:21, tck = 1, lty = 2, col = "grey", labels = NA)  # Add vertical grid
  lines(fitted_val_r_forecast_arima, col=2, lty=2)
  lines(fitted_val_py, col=3, lty=3)
  legend(13, -1, c("origin", "fitted_r", "fitted_py"), col=1:3, lty=1:3, cex=1, ncol=3, y.intersp=0, x.intersp=0, text.width=1)
  mtext(paste("Check fitted series trend for diff order:", dif_ord))
}

output image of R script: image

My questions are:

Versions (if necessary)

No response