Logistic Floor/Cap not being respected?

thomasnield commented 6 years ago

Pardon me in advance for being new to Prophet (and forecasting in general). I am trying to fit a curve for approximately 2 years worth of datetime data posted here, and forecasting a year's worth of 5-minute intervals.

Whether I do a linear or logistic fitting, I do get some outrageous values that are larger or smaller than any of the inputted values (linear even produces negative values). I tried to leverage a logistic model to specify a floor and cap, but that didn't stop deviant numbers either.

library(prophet)
library(dplyr)

df <- read.csv('http://bit.ly/2po0xPJ') %>% mutate(y=log(y))

#floor and cap
lower = quantile(df$y, .05)
upper = quantile(df$y, .95)

df <- df %>% mutate(floor = lower, cap = upper)

# modeling
m <- prophet(df, 
             changepoint.prior.scale=0.01, 
             growth = 'logistic')

future <- make_future_dataframe(m, periods = (24*12*365), 
                                freq = 60 * 5, 
                                include_history = FALSE)  %>% mutate(floor = lower, cap = upper)
# forecast every 5 minutes
forecast <- predict(m,future)
#prophet_plot_components(m, forecast)

write.csv(forecast %>% 
            select(ds, yhat_lower, yhat_upper, yhat) %>% 
            mutate(floor = exp(lower), 
                   cap = exp(upper),
                   ln_yhat_lower = yhat_lower,
                   ln_yhat_upper = yhat_upper, 
                   ln_yhat = yhat,
                   ln_floor = lower, 
                   ln_cap = upper,
                   yhat_lower = exp(yhat_lower), 
                   yhat_upper = exp(yhat_upper), 
                   yhat = exp(yhat)
                   ), 'problem_output.csv')

For instance, of the 105120 forecasted records outputted, 25131 are lower than the specified floor. Can someone please tell me what I'm doing wrong? Or if there is an expected behavior I'm not interpreting correctly?

bletham commented 6 years ago

Thanks for the clean repro. This is a case of bad model fit due to the daily seasonality overfitting. If you plot the forecast with plot(m, forecast) you can see that it looks like this:

prophet_plot

You can see that the in-sample fit seems pretty reasonable, but then the forecasted values are bad. You can see what is happening if you look at the components plot, with prophet_plot_components(m, forecast):

prophet_components1

You can see that the daily seasonality has enormous swings of +/- 2 in the afternoon, which is what is messing up the forecast. The reason this is happening is because if you look at df$ds there are no data with time greater than 12:59:00. With no data in the afternoon, the daily seasonality is being fit poorly there. There's some description of this happening with monthly data in the documentation her: https://facebook.github.io/prophet/docs/non-daily_data.html .

I'm wondering if this is an artifact of a bad conversion to 24-hour time. But if there really is only times less than 12:59:00, then there are three things you can do to resolve this issue with the daily seasonality: 1) Only make predictions for seasonal areas that you have data, so, filter any times >12:59:00 from the future dataframe that you make predictions on. 2) Remove the daily seasonality: m <- prophet(df, changepoint.prior.scale=0.01, growth = 'logistic', daily.seasonality = FALSE). 3) Use add_seasonality to add a daily seasonality with a stronger prior (smaller prior.scale).

I can imagine this issue coming up more frequently with sub-daily data, we should add better documentation of this behavior.

bletham commented 6 years ago

As a side note, the reason that with this bad seasonality the forecast goes outside the upper and lower bounds is because the upper and lower bounds are for the trend; Seasonal fluctuations will allow the forecast value to go outside those bounds.

thomasnield commented 6 years ago

Thanks for the detailed analysis, I'll mess around with those parameters and see what happens with my sub-daily case.

And yes, I figured maybe there was some volatility hurting seasonality when sub-daily data is introduced.

bletham commented 6 years ago

Just to be really clear, the data in df$ds should be 24-hour time and it looks like this might have used 12-hour time and so converted incorrectly. If that is the case, then fixing that would I expect resolve the issue without having to play around with parameters.

thomasnield commented 6 years ago

When I re-read this just now that popped out at me. It must be a 24 hour conversation issue on my end. I'll try it today.

thomasnield commented 6 years ago

@bletham you are exactly right, my bad. My Java process to prepare the data was formatting the LocalDateTime on a 12-hour clock, not 24.

I corrected the input, ran it through Prophet, and my analysis indicates the forecast is believable now. Thank you.

P.S. Would it be value-adding to raise an error if the forecast fails to meet constraints for whatever reason? Like the floor/cap not being met?

bletham commented 6 years ago

Well, like I said we can get fluctuations outside of lower/upper with the seasonality. But it would be nice to have some check for this type of situation where a big part of the seasonality has no data. At the very least there should be documentation about this, so I'm going to leave this task open for that.

denizn commented 6 years ago

Is it a must to apply log transformation to the y parameter? For what cases would you recommend it?

thomasnield commented 6 years ago

@deniznoah You might want to read this https://math.stackexchange.com/questions/2687851/what-does-ln-accomplish-on-a-regression-input/2688642#2688642

And this if you need a better understanding of Euler's number and natural logarithms. https://www.youtube.com/watch?v=m2MIpDrF7Es

denizn commented 6 years ago

Thank you for the docs they have been helpful, however I do not still understand how we would choose log(y) would be chosen. It could also be e^x depending on data.

thomasnield commented 6 years ago

@deniznoah But in Python and R, log() is precisely e^x. On these platforms, log() uses base e raised to x power.

bletham commented 6 years ago

Documentation now describes this issue in https://facebook.github.io/prophet/docs/non-daily_data.html , so I'll go ahead and close.

thomasnield commented 6 years ago

@bletham thank you kindly

shaidams64 commented 6 years ago

Hi, I'm trying to forecast daily transactions and was wondering if there's a way to put minimum threshold for seasonalities or if that makes sense at all. I fit the model on two years of data using logistic growth with floor= 0 and predicting for the next year. I'm getting positive trend , however, because of negative seasonalities I'm still getting negative forecasts for transactions:

Thanks a lot for your help in advance! P.S. is it possible/sensible to put floor on confidence bands?

denizn commented 6 years ago

Hi,

I have ended up in log transforming my dataset to avoid negative values @shaidams64 . You may try switching seasonality to multiplicative as an alternative to lower seasonality caused negatives. This feature is on 0.3.0.

Also I have noticed your trend is the reason you go negative, not the yearly seasonality. You may try changing trendpoints features such as n_changepoints and changepoint_prior_scale

Ex: m = Prophet(changepoint_prior_scale=0.001)

Deniz

bletham commented 6 years ago

@shaidams64 The floor/cap is just for the trend, and as you've observed the seasonality can push the forecast outside that.

Like @deniznoah suggested you could fit the model to the log of your data (with no floor), and then take the exp() of yhat and that would ensure positivity. It does, however, induce a different trend model and the exp() can sometimes be a bit sensitive to small changes in the history. So it might work or might not, you'd just have to see.

Multiplicative seasonality would also drive the seasonality to 0 as the trend goes to 0.

denizn commented 6 years ago

Is applying log transformation with linear growth not similar to no-log transformation with logistic growth floor=0? ( in terms of trend )

Thanks!!

bletham commented 6 years ago

I think it'd be similar here because it would give similar saturation, but if you were in a growth stage it would be exponential instead of sigmoidal.

shaidams64 commented 6 years ago

@bletham @deniznoah thanks a lot for your quick responses and suggestions. I actually have tried log transformation on a different case before but the resulting trend was not very promising and somehow confidence bands got really wide. I'll try using Multiplicative seasonalities and see if it fixes the negativity problem. Thanks again!

facebook / prophet

Logistic Floor/Cap not being respected? #470