alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

Question about n_jobs parameter in pm.auto_arima #234

Closed thinhnggia closed 4 years ago

thinhnggia commented 4 years ago

Question

I have a question about the n_jobs parameter in pm.auto_arima. I try to run the auto_arima to find the configuration for my model SARIMA. According to the document, if I set the stepwise = True, the n_jobs then has no effect, only when stepwise = False. And although I have set the n_jobs = 1, when I fit the model, It takes a very long time and especially somehow more than 1 core of my CPU are currently working, which supposed to be 1 as far as I understand. Can you help me explain what is the reason for this

Versions (if necessary)

Linux-4.15.0-58-generic-x86_64-with-debian-stretch-sid Python 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] pmdarima 1.4.0 Numpy 1.17.3 Scipy 1.3.0 Scikit-learn 0.21.2 Statsmodels 0.10.0

tgsmith61591 commented 4 years ago

Can you share some code so I can better understand what you're running? A data sample would also be helpful, if you can share it.

thinhnggia commented 4 years ago

Thank you for your response, so basically what I am trying to do is to first find the best parameters for the SARIMA model, the below code illustrates how I set up the configuration

def find_config(train,  segment):
       model_fit = pm.auto_arima(train, start_p = 1, start_q = 1, test = 'adf', 
max_p = 3, max_q = 3, m = 48, d = None, D = None, seasonal = True, start_P= 0, start_Q =0, max_P = 3, max_Q = 3, trace = True, n_jobs = 1, 
stepwise = True,  error_action = 'ignore', suppress_warnings = True )

       print("Best model: ", model_fit.summary())
       return model_fit

model = find_config(train,  '30min')

So the train is the fitted data in .csv format which has a form of 2 columns: Date | Density 1/1/2017 0:0:0 510 1/1/2017 0:30:0 520 1/1/2017 1:0:0 450 The data is the amount of cars of traffic at the cross road. So basically it has the length of 48*365 data for 1 year, with the difference between 2 specific time is 30 minutes. You can randomly generate the sample of data with random amount of density. It is a pity that I cant upload the full data. So the whole data is train, then I run the find config function and it appears that it took a lot of running time and a lot of CPU cores. That makes me quite struggle

tgsmith61591 commented 4 years ago

This is not really a question of n_jobs. I'm guessing your data is fairly large, and m=48 is quite high for a SARIMAX model. Even for a stepwise search, this can take a while. There's an open issue on the statsmodels' page that describes almost the exact symptoms you're facing: https://github.com/statsmodels/statsmodels/issues/5727

I don't have a great answer for you at the moment other than "let it run" (how long does it take, by the way?); since we use statsmodels under the hood, whatever the resolution to that issue is on their part will probably greatly speed things up for pmdarima as well.

thinhnggia commented 4 years ago

So I have been running this pretty long, some configuration takes like more than 2 hours to fit in. And I don't really understand why I use htop to display the process, It takes like 44 sub process running

tgsmith61591 commented 4 years ago

There could be some form of parallelism under the hood with the lbfgs solver in statsmodels, but there is not any from pmdarima when you have stepwise=True. See for yourself:

https://github.com/tgsmith61591/pmdarima/blob/master/pmdarima/arima/_auto_solvers.py#L169

thinhnggia commented 4 years ago

Yeah I did look at the code to find which one causes the multiprocessing processes. So the multiprocess actually comes from the lbfgs right

tgsmith61591 commented 4 years ago

I am not saying that definitively. I am saying that the subprocesses are not coming from pmdarima's stepwise search

thinhnggia commented 4 years ago

Thank you very much for your time. I will try to look further into stats models

tgsmith61591 commented 4 years ago

You could also try method='nm', which seems to be faster. But now we're getting into very granular statsmodels territory.

See statsmodels optimizers for more info https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/optimizer.py

tgsmith61591 commented 4 years ago

Closing. Feel free to reopen if any further questions reemerge