alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
MIT License
1.57k stars 231 forks source link

Bad sales predictions on daily data #32

Closed munitech4u closed 6 years ago

munitech4u commented 6 years ago


The predictions from Auto-Arima for a daily data are almost same average value. Is there anything I am doing wrong

Steps/Code to Reproduce


arima = auto_arima(np.array(train['sales']), start_p=1, start_q=1, d=0, max_p=5, max_q=5,
                   out_of_sample_size=5, suppress_warnings=True,
                   stepwise=True, error_action='ignore',trace=True)

preds= arima.predict(n_periods=test.shape[0],return_conf_int=False)

Expected Results

Not very similar predictionss

Actual Results

Similar results for extended period of time. Values are almost same as 73,74,75. There is not trend capture. Not sure, if I am doing it correctly


Windows-10-10.0.15063 ('Python', '2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)]') ('Pyramid', '0.7.1') ('NumPy', '1.14.3') ('SciPy', '1.1.0') ('Scikit-Learn', '0.19.1') ('Statsmodels', '0.9.0'))

tgsmith61591 commented 6 years ago

I think this is more a case of misunderstanding how ARIMAs work. Here's some (hopefully) helpful examples using Pyramid v0.8.1:

import pyramid as pm
import pandas as pd

# Read the table
X = pd.read_excel('/path/to/item_sales_daily.xlsx')

# Get the sales data, set date index for plotting
y = X['sales'].values

We can now look at the autocorrelation:

>>> pm.autocorr_plot(y)


Notice what appears to be an annual seasonal trend. If you see the documentation's section on understanding seasonal periodicity (m), you can probably reason your way into a reasonable m setting. Since it's daily data with an annual trend, you might be looking at an m of 365, but you know your data better than I do, so I'm not going to tell you that's the correct answer.

Number of differences

You set d=0 for some reason. Do you have reason to believe your data is already stationary? Because it's definitely not. Here's how you can estimate the d parameter (again, this is an estimate):

>>> pm.arima.ndiffs(y, test='kpss', max_d=5)

This is performing a Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test in the background to determine the number of differences which adequately makes your time series stationary. 1 seems to do the trick. Tests also available are test='pp' (Phillips–Perron) and test='adf' (Augmented Dickey-Fuller). See the documentation's section on enforcing stationarity and the API ref for more detailed information on these tests.

Number of seasonal differences

We can perform a similar estimate of D using a Canova-Hansen test for seasonal differencing. Assuming m=365 (which, again, is my own uneducated guess):

>>> pm.arima.nsdiffs(y, m=365, max_D=5)

Then now you've learned several things:

That's a starting point. Since this is such a data-related question, I can't solve the whole thing for you, but hopefully that gives you a jumping-off point.

munitech4u commented 6 years ago

Sorry, my bad. I didn't notice d=0 (I thought it tried to also find d). however setting d=1 and D=3 results in below error:

ValueError: if explicitly defined, d & D must be <= max_d & <= max_D, respectively

Really appreciate your effort in putting out detailed answer!! Cheers!

munitech4u commented 6 years ago

Never mind, The error got rectified after adding max_D parameter (by default it is 2)

tgsmith61591 commented 6 years ago

Yeah the ValueError is a bit of silly over-validation on my end. That's been fixed in v0.9.0 (not yet released)