alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.59k stars 234 forks source link

Why is pmdarima predict function missing start and end parameters like the underlying statsmodels arima module. #141

Closed poojithaamin closed 5 years ago

poojithaamin commented 5 years ago

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

poojithaamin commented 5 years ago

This is what I was talking about--> https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.predict.html#statsmodels.tsa.arima_model.ARIMA.predict I want to be able to specify an end date upto which the predictions needs to be made, instead of number of periods. Do you plan to include it in future versions?

tgsmith61591 commented 5 years ago

Short answer: pmdarima is not statsmodels.

Longer answer: predict is pmdarima's forecast equivalent of statsmodels. Specifying a starting point for forecasts doesn't make much sense; if you want a forecast 5 periods in the future, that's easily achievable in the current state. If you want in-sample predictions, that's also achievable. I don't understand what specifying a start/end point is going to get you as far as utility that the package doesn't already address. Therefore, I don't intend to add that functionality.

Do you have a good, specific reason why this is necessary? And if so, can you demonstrate it with a repeatable example?

poojithaamin commented 5 years ago

Thank you for your reply. Suppose I have an ARIMA model, fit with data till Feb 2018 and I have a requirement to predict monthly data till Dec 2018. In a scenario where I am just provided with the model without prior knowledge of the last index of the fit, meaning, I am not aware of the last month of the model fit, and I want to make forecasts till Dec 2018. I would not be able to estimate the number of periods in this case. Having an end_date as a parameter would be useful here. I hope I was able to explain the case.

tgsmith61591 commented 5 years ago

I understand that need, but date logic is something that semantically probably shouldn't live within a mathematical library, especially with nuances like timezones, daylight savings, etc. Make sense?

Best advice would be to have a utility function that computes the number of periods forward from today that you need to estimate, and just calculate that number of periods forward.

Furthermore, we convert all timeseries arrays to numpy arrays as internal representations (because we use a lot of Cython internally) so any date information in an index would be lost.

Finally, we tend to support the philosophy that a project should address its scope very well rather than trying to solve all possible permutations of a problem. Given this type of issue can be so domain specific, we made an early decision not to deal with dates and to handle everything with slicing. We feel, in the long-term, this gives everyone more flexibility since they can pre- and post-process as needed, and aren't at the mercy of a black box.

I'll leave this issue open for a while and if there is enough interest, I may change my stance. But keep in mind, PRs are always welcome.

tgsmith61591 commented 5 years ago

Also keep in mind the update function exists to continually maintain your model. Properly maintained, you should always have a good estimate of when the last observed values occurred.

poojithaamin commented 5 years ago

In that case, I'll look at the possibility of saving the last date of the training data as metadata in our system and use that to calculate n_periods while forecasting. Thanks once again for your time.

tgsmith61591 commented 5 years ago

~Date may not even be part of the metadata. We accept all forms of 1d arrays (tuples, lists, series, anything expressable as 1d). I take the stance that the responsibility of model documentation falls on the developer.~

(Sorry I misread your comment)

That said, you can access the original endogenous array of a fitted arima:

your_model.arima_res_.model.data.endog

That indirectly solves your problem

poojithaamin commented 5 years ago

yeah.. like you said, model.arimares.model.data.endog, does not give the index of the data, which is date, even if I fit the model with a series having date as index.