kernel keeps dying in version 0.4

zhangw64 commented 5 years ago

I was using fbprophet version 0.4 in jupyter notebook. For a certain subset of data, the fit stays at a point "n_changepoints greater than number of observations.Using 7.0", and after an hour it shows 'kernel dead'. However, I was able to run the forecast using version 0.3.

bletham commented 5 years ago

Interesting, there weren't any changes from 0.3 to 0.4 that I'd expect to impact model fitting. Is there any chance you could post the data so I could reproduce the issue?

Anyway, it does sound that the time series is very short and historically very short time series can freeze fitting by freezing the optimization in Stan. See #842 for a contemporary example of this. In every case I've seen so far you can get around this by using the Newton algorithm instead of the L-BFGS in Stan, like

m.fit(df, algorithm='Newton')

But if you're able to post the time series that produces this that'd be very helpful so I can make sure it is working in the next version.

ericlentz commented 5 years ago

I don't know if this is related or a different issue, but we have many occasions where we get to this point, in .4:

Optimization terminated normally: Convergence detected: absolute parameter change was below tolerance INFO:fbprophet:n_changepoints greater than number of observations.Using 2.0. Initial log joint probability = -2.28944 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 3 2.64803 3.14151e-05 27.6364 1.223e-06 0.001 40 LS failed, Hessian reset 5 2.65119 8.98327e-05 27.6356 3.498e-06 0.001 87 LS failed, Hessian reset 7 2.6597 8.80542e-05 27.634 3.43e-06 0.001 126 LS failed, Hessian reset 9 2.66879 0.000182304 27.6321 7.102e-06 0.001 171 LS failed, Hessian reset 11 2.68725 0.000191528 27.6287 7.464e-06 0.001 213 LS failed, Hessian reset 13 2.7065 0.000534171 27.6242 2.083e-05 0.001 257 LS failed, Hessian reset

and it sits and consumes CPU but never moves on.

Other times, it looks like this: Optimization terminated normally: Convergence detected: absolute parameter change was below tolerance Initial log joint probability = -3.08333 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 61.2723 1.08519e-07 83.8559 0.2011 1 125
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes

Again, hangs, and no progress.

We started on .4, so I don't know if this worked in .3?

bletham commented 5 years ago

@ericlentz probably the same, could you post an example dataset that produces it? For instance the first one looks like it is only 3 points in the history, that'd be really helpful if you could post what they are.

ericlentz commented 5 years ago

I don't think it was just 3 data points? In order dataset, we have a forecasting horizon and we had it set to 57. We had 12 data points for the first and 11 for the second, but then we zero fill so that, in the end, we actually have 57 data points. It is kind of a complex process, so I tried to study just the data that caused the problem and found that, in every case, it executed without failing.

So I'm thinking that perhaps it has to do with threading? @zhangw64, did you possibly have multiple threads running? In my case, I could have had another thread that was running a FB forecast? I also tried running multiple threads, over and over, and it still works for me, but with the nature of multi-threaded issues, you have to hit it just right (if that is possibly the problem).

In our forecast, we've got at least 100k forecasts we're running on multiple servers and we found several servers that sat there consuming CPU on the FB forecast. I'm not sure how to duplicate it though?

I could get data for you, but you'll find it working, so I'm not sure how much value that would be to you? Is it possible that multiple threads could be the issue?

bletham commented 5 years ago

hm, the "INFO:fbprophet:n_changepoints greater than number of observations.Using 2.0." that is being printed implies that

np.floor(n * m.changepoint_range) = 3

where n is the number of rows in the history, and m.changepoint_range defaults to 0.8. This implies there are four rows in the history. (https://github.com/facebook/prophet/blob/master/python/fbprophet/forecaster.py#L339)

Rows with NaNs are removed from the history, so maybe those are being introduced when imputing the data? You can reproduce the internal transforms that Prophet makes to the data frame that you pass into fit (df) by running:

history = df[df['y'].notnull()].copy()
history = m.setup_dataframe(history, initialize_scales=True)

If you do this on the dataset that produces the info message above, you should find that history.shape[0] = 4.

ericlentz commented 5 years ago

We're not using NaNs. We are literally using zero values.

There are some ways in which we're manipulating the data that in fact we do end up with 4 data points. You turn out to be right about that.

Anyway, this output does not have message, "fbprophet:n_changepoints greater than number of observations.Using 2.0," yet it stalled as well:

Initial log joint probability = -2.84 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 70.9289 6.8518e-06 95.8768 1 1 123
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 128 70.9297 5.56874e-07 106.496 5.591e-09 0.001 195 LS failed, Hessian reset 141 70.9297 5.9695e-09 100.329 0.3745 0.3745 211
Optimization terminated normally: Convergence detected: absolute parameter change was below tolerance Initial log joint probability = -3.08333 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 61.2723 1.08519e-07 83.8559 0.2011 1 125
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes

As did this one:

Convergence detected: absolute parameter change was below tolerance INFO:fbprophet:n_changepoints greater than number of observations.Using 15.0. Initial log joint probability = -2.75426 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 37 15.5189 0.000570834 77.4792 7.452e-06 0.001 88 LS failed, Hessian reset 63 15.5759 6.27413e-05 75.2859 8.007e-07 0.001 170 LS failed, Hessian reset

What I find really interesting is that I had transferred the data to my local development system, and I don't get the same output as I did when it ran in the production system giving the aforementioned error. The "Hessian reset" numbers were not the same, for example. Much of the output was actually different. I'm not sure what to make of this since I used the exact same data and code?

bletham commented 5 years ago

The stuff that is being printed out is from Stan's internal L-BFGS optimizer. We have found that there can be differences in the L-BFGS path in different systems. This is probably due to differences in how the underlying linear algebra libraries were compiled, and perhaps differences in numerical precisions. In fact, there can be differences in the L-BFGS path on the same system when running the same code twice, which we believe is due to numerical precision. So seeing slightly different outputs is not necessarily a bad thing. This is especially likely to happen when there are very few datapoints because the model parameters are underspecified, and the likelihood function that is being optimized will be quite flat near the optimum.

If you have a dataset that reliably produces the freeze, please do post it so I can make sure we get it covered in the future. In the meantime, we have found that switching from Stan's L-BFGS to Stan's Newton optimizer. It's slower for big datasets but seems to be more robust in these settings. You can do this with

m.fit(df, algorithm='Newton')

Could you let me know if the problem persists with the Newton optimizer?

bletham commented 5 years ago

I'm going to consolidate this with #842 which I believe is the same issue.

facebook / prophet

kernel keeps dying in version 0.4 #857