Question about cross_validation behaviour

ziogianni commented 6 years ago

Hello guys, I'm quite new to the forecasting world, so I hope you will apologise me for any mispractice. I have decided to try out this really promising library, by installing it in R.

Unfortunately my historical dataset is very lean, in particular I'd like to predict weekly data but I can only rely on old data starting from September 2017.

At the moment, the forecast obtained doesn't seem to be so bad. At least looking at the plot, the trend should be credible.

rplot

Here's my code:

m <- prophet(growth = "linear", changepoints = NULL, n.changepoints = 25, yearly.seasonality = "auto", weekly.seasonality = "auto", daily.seasonality = "auto", holidays = NULL, seasonality.prior.scale = 10, holidays.prior.scale = 10, changepoint.prior.scale = 0.05, mcmc.samples = 0, interval.width = 0.8, uncertainty.samples = 1000, fit = TRUE)

m <- add_seasonality(m, name='weekly', period=7.139, fourier.order=5) m <- fit.prophet(m, df) future <- make_future_dataframe(m, periods = 8, freq='week') tail(future) forecast <- predict(m, future)

The only issue I'm facing right now is cross_validation, maybe I've missed a piece of the puzzle there.

df.cv <- cross_validation(m, horizon = 1, units = 'weeks')

I have set my horizon to 1 and as unit I've set 'weeks' because we're dealing with weekly information. My understanding is that after executing the cross_validation command, if you have got a reliable model, the 'y' value should be very close to the 'yhat'.

In my case, this is true just for a few values of the list attached below.

     ds        y        yhat      yhat_lower yhat_upper     cutoff
1 | 2017-09-15 | 1609 | 1556.9537 | 1556.9537 | 1556.9537 | 2017-09-08 00:00:00 2 | 2017-09-15 | 1609 | 1556.9537 | 1556.9537 | 1556.9537 | 2017-09-11 12:00:00 3 | 2017-09-22 | 1567 | 1733.0607 | 1725.2161 | 1738.9306 | 2017-09-15 00:00:00 4 | 2017-09-22 | 1567 | 1733.0607 | 1723.4771 | 1740.2495 | 2017-09-18 12:00:00 5 | 2017-09-29 | 1738 | 1458.4614 | 1434.8213 | 1484.7987 | 2017-09-22 00:00:00 6 | 2017-09-29 | 1738 | 1458.4614 | 1434.2319 | 1484.8990 | 2017-09-25 12:00:00 7 | 2017-10-06 | 1853 | 1736.7123 | 1663.6508 | 1813.1654 | 2017-09-29 00:00:00 8 | 2017-10-06 | 1853 | 1736.7123 | 1656.0339 | 1817.3020 | 2017-10-02 12:00:00 9 | 2017-10-13 | 1757 | 2139.3850 | 2090.8411 | 2185.9545 | 2017-10-06 00:00:00 10 | 2017-10-13 | 1757 | 2139.3850 | 2092.3138 | 2190.0010 | 2017-10-09 12:00:00 11 | 2017-10-20 | 1739 | 1799.4963 | 1729.1459 | 1878.6503 | 2017-10-13 00:00:00 12 | 2017-10-20 | 1739 | 1799.4963 | 1724.7637 | 1879.3809 | 2017-10-16 12:00:00 13 | 2017-10-27 | 1488 | 1757.0873 | 1675.6114 | 1835.4215 | 2017-10-20 00:00:00 14 | 2017-10-27 | 1488 | 1757.0873 | 1677.2412 | 1833.0888 | 2017-10-23 12:00:00 15 | 2017-11-03 | 1088 | 1192.1985 | 1145.0227 | 1240.9929 | 2017-10-27 00:00:00 16 | 2017-11-03 | 1088 | 1192.1985 | 1142.7818 | 1243.0260 | 2017-10-30 12:00:00 17 | 2017-11-10 | 966 | 517.0742 | 440.1233 | 594.6470 | 2017-11-03 00:00:00 18 | 2017-11-10 | 966 | 517.0742 | 438.7388 | 600.0148 | 2017-11-06 12:00:00 19 | 2017-11-17 | 851 | 764.1270 | 674.3480 | 860.2381 | 2017-11-10 00:00:00 20 | 2017-11-17 | 851 | 764.1270 | 674.7672 | 855.8064 | 2017-11-13 12:00:00 21 | 2017-11-24 | 931 | 727.5935 | 631.8289 | 813.3101 | 2017-11-17 00:00:00 22 | 2017-11-24 | 931 | 727.5935 | 640.8973 | 811.0621 | 2017-11-20 12:00:00 23 | 2017-12-01 | 836 | 930.6910 | 835.4080 | 1024.3740 | 2017-11-24 00:00:00 24 | 2017-12-01 | 836 | 930.6910 | 831.9682 | 1021.9264 | 2017-11-27 12:00:00 25 | 2017-12-08 | 811 | 970.1600 | 884.0843 | 1071.5890 | 2017-12-01 00:00:00 26 | 2017-12-08 | 811 | 970.1600 | 881.3290 | 1062.8259 | 2017-12-04 12:00:00 27 | 2017-12-15 | 875 | 889.1682 | 797.6200 | 977.4974 | 2017-12-08 00:00:00 28 | 2018-01-05 | 892 | 852.8955 | 754.5899 | 955.0654 | 2017-12-29 00:00:00 29 | 2018-01-05 | 892 | 852.8955 | 754.5030 | 947.5781 | 2018-01-01 12:00:00 30 | 2018-01-12 | 1053 | 842.6385 | 760.2330 | 929.4788 | 2018-01-05 00:00:00 31 | 2018-01-12 | 1053 | 842.6385 | 755.1882 | 933.4865 | 2018-01-08 12:00:00 32 | 2018-01-19 | 1217 | 1016.7662 | 919.4204 | 1109.5356 | 2018-01-12 00:00:00 33 | 2018-01-19 | 1217 | 1016.7662 | 918.6774 | 1110.8132 | 2018-01-15 12:00:00 34 | 2018-01-26 | 1253 | 1491.8957 | 1391.2217 | 1595.6083 | 2018-01-19 00:00:00 35 | 2018-01-26 | 1253 | 1491.8957 | 1394.0870 | 1588.5701 | 2018-01-22 12:00:00 36 | 2018-02-02 | 1328 | 1410.2508 | 1323.9255 | 1494.1299 | 2018-01-26 00:00:00 37 | 2018-02-02 | 1328 | 1410.2508 | 1326.7867 | 1497.1228 | 2018-01-29 12:00:00 38 | 2018-02-09 | 1576 | 1471.8970 | 1390.8751 | 1559.6157 | 2018-02-02 00:00:00 39 | 2018-02-09 | 1576 | 1471.8970 | 1382.8465 | 1558.1751 | 2018-02-05 12:00:00 40 | 2018-02-16 | 1631 | 1678.9730 | 1600.0055 | 1768.6881 | 2018-02-09 00:00:00

What do you think about that? I would be very glad to get a few hints from your side.

Thanks

bletham commented 6 years ago

I think there would be an issue here with the default settings for cross_validattion not providing enough training data for the model to do something reasonable.

There are two optional arguments to cross_validation:

initial: The amount of time used for the first cutoff.
period: The amount of time between each cutoff. So, the first cutoff will use initial, the second will use initial + period, then initial + 2 * period, etc.

https://facebook.github.io/prophet/docs/diagnostics.html has an attempt at illustrating these.

In this case the horizon is very short relative to the data frequency, which is causing many of the cutoffs to not have very much data. The default value for initial is 3 * horizon, which here only 3 weeks and so not enough time to have much training data before the cutoff. The default value for period is horizon/2, so here every 3.5 days. This means there is not much change from one cutoff to the next.

You want your initial to be large enough to capture the features of the timeseries that are important for forecasting, so that you can expect the forecast on that cutoff to be representative of that on the full dataset. That won't really be possible here, since there is a pretty clear yearly seasonality but less than a year of data, so there won't be any way to evaluate out-of-sample how good the yearly seasonality prediction is.

You can set initial to something bigger like 2 months and period to maybe 1 month. But with so few data there aren't really enough points to both (1) have enough training data for a reasonable model fit, and (2) have enough test data to get a good estimate of prediction error.

ziogianni commented 6 years ago

Hey Ben, thanks for your feedback. So as per my understanding I'm in a blind alley.

My only option is to collect as many data as possible which could mean to wait a year to two in order to get a reliable forecast.

BTW, following your suggestion I executed this command: df.cv <- cross_validation(m, units = 'days', horizon = 7, initial = 60, period = 30)

Then after a few warnings I got this:

# A tibble: 4 x 6
  ds                      y  yhat yhat_lower yhat_upper cutoff             
* <dttm>              <dbl> <dbl>      <dbl>      <dbl> <dttm>             
1 2017-11-24 00:00:00   931   728        640        820 2017-11-18 00:00:00
2 2017-12-22 00:00:00   876   900        804        998 2017-12-18 00:00:00
3 2018-01-19 00:00:00  1217  1059        978       1147 2018-01-17 00:00:00
4 2018-02-23 00:00:00  1644  1693       1598       1785 2018-02-16 00:00:00

Does it help to shed a light?

bletham commented 6 years ago

Actually you probably want a faster period since this now isn't using all of the data for prediction error - period of 7 would make a prediction for each datapoint.

ziogianni commented 6 years ago

Ben are you talking about the prediction period? Or do you refer to the cross_validation period?

I'm asking this because by analyzing weekly data, I need weekly predictions. I don't think, for instance, that daily predictions would really help in my case, unless it doesn't represent a trick to improve accuracy.

bletham commented 6 years ago

The cross_validation period. Setting that to 7 would give you more estimates of the one-week forecast error.

ziogianni commented 6 years ago

Hey Ben as you suggested I set the cross_validation period to 7.

cross_validation(m, units = 'days', horizon = 7, initial = 60, period = 7)

Here's what I got, there're more estimates but in some cases it didn't hit the "bull's eye". What do you think?

     ds         y     yhat    yhat_lower  yhat_upper  cutoff 
1 | 2017-11-10 | 966 | 517.0742 | 440.7792 | 596.2305 | 2017-11-03 2 | 2017-11-17 | 851 | 764.1270 | 662.3019 | 855.7232 | 2017-11-10 3 | 2017-11-24 | 931 | 727.5935 | 631.4496 | 816.6770 | 2017-11-17 4 | 2017-12-01 | 836 | 930.6910 | 838.3093 | 1023.5463 | 2017-11-24 5 | 2017-12-08 | 811 | 970.1600 | 876.7312 | 1055.3249 | 2017-12-01 6 | 2017-12-15 | 875 | 889.1682 | 799.7888 | 980.4534 | 2017-12-08 7 | 2017-12-22 | 876 | 899.9459 | 801.2550 | 993.9769 | 2017-12-15 8 | 2017-12-29 | 888 | 839.3363 | 752.7054 | 918.1346 | 2017-12-22 9 | 2018-01-05 | 892 | 834.4776 | 748.0546 | 922.3051 | 2017-12-29 10 | 2018-01-12 | 1053 | 836.1376 | 744.2522 | 918.7555 | 2018-01-05 11 | 2018-01-19 | 1217 | 1059.0269 | 969.8517 | 1149.8541 | 2018-01-12 12 | 2018-01-26 | 1253 | 1431.7984 | 1337.1510 | 1517.1897 | 2018-01-19 13 | 2018-02-02 | 1328 | 1495.4484 | 1408.4046 | 1581.5992 | 2018-01-26 14 | 2018-02-09 | 1576 | 1467.5830 | 1385.6254 | 1551.3577 | 2018-02-02 15 | 2018-02-16 | 1642 | 1675.1145 | 1590.2996 | 1754.6633 | 2018-02-09 16 | 2018-02-23 | 1644 | 1693.1984 | 1600.1909 | 1779.1477 | 2018-02-16

bletham commented 6 years ago

This seems to me to be about the best estimate that we can get with the data available.

facebook / prophet

Question about cross_validation behaviour #460