alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

Bad sales predictions on daily data #32

Closed munitech4u closed 6 years ago

munitech4u commented 6 years ago

Description

The predictions from Auto-Arima for a daily data are almost same average value. Is there anything I am doing wrong

Steps/Code to Reproduce

item_sales_daily.xlsx

arima = auto_arima(np.array(train['sales']), start_p=1, start_q=1, d=0, max_p=5, max_q=5,
                   out_of_sample_size=5, suppress_warnings=True,
                   stepwise=True, error_action='ignore',trace=True)

preds= arima.predict(n_periods=test.shape[0],return_conf_int=False)

Expected Results

Not very similar predictionss

Actual Results

Similar results for extended period of time. Values are almost same as 73,74,75. There is not trend capture. Not sure, if I am doing it correctly

Versions

Windows-10-10.0.15063 ('Python', '2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)]') ('Pyramid', '0.7.1') ('NumPy', '1.14.3') ('SciPy', '1.1.0') ('Scikit-Learn', '0.19.1') ('Statsmodels', '0.9.0'))

tgsmith61591 commented 6 years ago

I think this is more a case of misunderstanding how ARIMAs work. Here's some (hopefully) helpful examples using Pyramid v0.8.1:

import pyramid as pm
import pandas as pd

# Read the table
X = pd.read_excel('/path/to/item_sales_daily.xlsx')

# Get the sales data, set date index for plotting
y = X['sales'].values

We can now look at the autocorrelation:

>>> pm.autocorr_plot(y)

figure_1

Notice what appears to be an annual seasonal trend. If you see the documentation's section on understanding seasonal periodicity (m), you can probably reason your way into a reasonable m setting. Since it's daily data with an annual trend, you might be looking at an m of 365, but you know your data better than I do, so I'm not going to tell you that's the correct answer.

Number of differences

You set d=0 for some reason. Do you have reason to believe your data is already stationary? Because it's definitely not. Here's how you can estimate the d parameter (again, this is an estimate):

>>> pm.arima.ndiffs(y, test='kpss', max_d=5)
1

This is performing a Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test in the background to determine the number of differences which adequately makes your time series stationary. 1 seems to do the trick. Tests also available are test='pp' (Phillips–Perron) and test='adf' (Augmented Dickey-Fuller). See the documentation's section on enforcing stationarity and the API ref for more detailed information on these tests.

Number of seasonal differences

We can perform a similar estimate of D using a Canova-Hansen test for seasonal differencing. Assuming m=365 (which, again, is my own uneducated guess):

>>> pm.arima.nsdiffs(y, m=365, max_D=5)
3

Then now you've learned several things:

That's a starting point. Since this is such a data-related question, I can't solve the whole thing for you, but hopefully that gives you a jumping-off point.

munitech4u commented 6 years ago

Sorry, my bad. I didn't notice d=0 (I thought it tried to also find d). however setting d=1 and D=3 results in below error:

ValueError: if explicitly defined, d & D must be <= max_d & <= max_D, respectively

Really appreciate your effort in putting out detailed answer!! Cheers!

munitech4u commented 6 years ago

Never mind, The error got rectified after adding max_D parameter (by default it is 2)

tgsmith61591 commented 6 years ago

Yeah the ValueError is a bit of silly over-validation on my end. That's been fixed in v0.9.0 (not yet released)