Closed ziogianni closed 6 years ago
I think there would be an issue here with the default settings for cross_validattion
not providing enough training data for the model to do something reasonable.
There are two optional arguments to cross_validation
:
initial
: The amount of time used for the first cutoff. period
: The amount of time between each cutoff. So, the first cutoff will use initial
, the second will use initial + period
, then initial + 2 * period
, etc.https://facebook.github.io/prophet/docs/diagnostics.html has an attempt at illustrating these.
In this case the horizon is very short relative to the data frequency, which is causing many of the cutoffs to not have very much data. The default value for initial
is 3 * horizon
, which here only 3 weeks and so not enough time to have much training data before the cutoff. The default value for period
is horizon/2
, so here every 3.5 days. This means there is not much change from one cutoff to the next.
You want your initial
to be large enough to capture the features of the timeseries that are important for forecasting, so that you can expect the forecast on that cutoff to be representative of that on the full dataset. That won't really be possible here, since there is a pretty clear yearly seasonality but less than a year of data, so there won't be any way to evaluate out-of-sample how good the yearly seasonality prediction is.
You can set initial
to something bigger like 2 months and period
to maybe 1 month. But with so few data there aren't really enough points to both (1) have enough training data for a reasonable model fit, and (2) have enough test data to get a good estimate of prediction error.
Hey Ben, thanks for your feedback. So as per my understanding I'm in a blind alley.
My only option is to collect as many data as possible which could mean to wait a year to two in order to get a reliable forecast.
BTW, following your suggestion I executed this command:
df.cv <- cross_validation(m, units = 'days', horizon = 7, initial = 60, period = 30)
Then after a few warnings I got this:
# A tibble: 4 x 6
ds y yhat yhat_lower yhat_upper cutoff
* <dttm> <dbl> <dbl> <dbl> <dbl> <dttm>
1 2017-11-24 00:00:00 931 728 640 820 2017-11-18 00:00:00
2 2017-12-22 00:00:00 876 900 804 998 2017-12-18 00:00:00
3 2018-01-19 00:00:00 1217 1059 978 1147 2018-01-17 00:00:00
4 2018-02-23 00:00:00 1644 1693 1598 1785 2018-02-16 00:00:00
Does it help to shed a light?
Actually you probably want a faster period since this now isn't using all of the data for prediction error - period of 7 would make a prediction for each datapoint.
Ben are you talking about the prediction period? Or do you refer to the cross_validation period?
I'm asking this because by analyzing weekly data, I need weekly predictions. I don't think, for instance, that daily predictions would really help in my case, unless it doesn't represent a trick to improve accuracy.
The cross_validation period. Setting that to 7 would give you more estimates of the one-week forecast error.
Hey Ben as you suggested I set the cross_validation period to 7.
cross_validation(m, units = 'days', horizon = 7, initial = 60, period = 7)
Here's what I got, there're more estimates but in some cases it didn't hit the "bull's eye". What do you think?
ds y yhat yhat_lower yhat_upper cutoff
1 | 2017-11-10 | 966 | 517.0742 | 440.7792 | 596.2305 | 2017-11-03 2 | 2017-11-17 | 851 | 764.1270 | 662.3019 | 855.7232 | 2017-11-10 3 | 2017-11-24 | 931 | 727.5935 | 631.4496 | 816.6770 | 2017-11-17 4 | 2017-12-01 | 836 | 930.6910 | 838.3093 | 1023.5463 | 2017-11-24 5 | 2017-12-08 | 811 | 970.1600 | 876.7312 | 1055.3249 | 2017-12-01 6 | 2017-12-15 | 875 | 889.1682 | 799.7888 | 980.4534 | 2017-12-08 7 | 2017-12-22 | 876 | 899.9459 | 801.2550 | 993.9769 | 2017-12-15 8 | 2017-12-29 | 888 | 839.3363 | 752.7054 | 918.1346 | 2017-12-22 9 | 2018-01-05 | 892 | 834.4776 | 748.0546 | 922.3051 | 2017-12-29 10 | 2018-01-12 | 1053 | 836.1376 | 744.2522 | 918.7555 | 2018-01-05 11 | 2018-01-19 | 1217 | 1059.0269 | 969.8517 | 1149.8541 | 2018-01-12 12 | 2018-01-26 | 1253 | 1431.7984 | 1337.1510 | 1517.1897 | 2018-01-19 13 | 2018-02-02 | 1328 | 1495.4484 | 1408.4046 | 1581.5992 | 2018-01-26 14 | 2018-02-09 | 1576 | 1467.5830 | 1385.6254 | 1551.3577 | 2018-02-02 15 | 2018-02-16 | 1642 | 1675.1145 | 1590.2996 | 1754.6633 | 2018-02-09 16 | 2018-02-23 | 1644 | 1693.1984 | 1600.1909 | 1779.1477 | 2018-02-16
This seems to me to be about the best estimate that we can get with the data available.
Hello guys, I'm quite new to the forecasting world, so I hope you will apologise me for any mispractice. I have decided to try out this really promising library, by installing it in R.
Unfortunately my historical dataset is very lean, in particular I'd like to predict weekly data but I can only rely on old data starting from September 2017.
At the moment, the forecast obtained doesn't seem to be so bad. At least looking at the plot, the trend should be credible.
Here's my code:
The only issue I'm facing right now is cross_validation, maybe I've missed a piece of the puzzle there.
I have set my horizon to 1 and as unit I've set 'weeks' because we're dealing with weekly information. My understanding is that after executing the cross_validation command, if you have got a reliable model, the 'y' value should be very close to the 'yhat'.
In my case, this is true just for a few values of the list attached below.
What do you think about that? I would be very glad to get a few hints from your side.
Thanks