facebook / prophet

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
https://facebook.github.io/prophet
MIT License
18.32k stars 4.52k forks source link

Cross validation with other forecasts as regressors #442

Open bletham opened 6 years ago

bletham commented 6 years ago

If we have other time series as regressors that are being forecasted using Prophet, then cross validation should also forecast those regressors. Right now it would use the true future values which would typically underestimate forecast error.

toddbot commented 6 years ago

I was trying to understand this one a bit more. Would an example of another time-series as a regressor be the sequence of dates of all super-bowls going back several years? Then, assuming we don't have dates already for future super-bowls, we should be regressing those dates as well and incorporating them into the holidays dataframe?

bletham commented 6 years ago

One might gain benefit from including a time series as an extra regressor if 1) that time series might be correlated with the one of interest, and 2) you expect that forecast to be more accurate than that of interest. That way you can expect to increase the accuracy of your main forecast by including it.

A more natural example for the documentation examples would be some other wikipedia page which we expect to be correlated with Manning's, and which also has less uncertainty (e.g. more traffic).

Another more natural example might be if we wanted to forecast number of weekly Prophet issues, then we might include as an extra regressor the forecast of the number of weekly downloads - something that is likely correlated, and has less variance.

skannan-maf commented 2 years ago

That need not always be the case that the external regressors need to be forecasted.

There is a need to differentiate 2 types of regressors.

One for which future values can always be determined and so the data in the data frame can be reused while cross validating as prophet does today.

Second for which the future value could be the same as last known value (we use such regressors for what-if analysis) or the future value comes from an external method given the past values (or) forecasted from past values.

I think this issue needs to be prioritized as this leaks future values for some models and people might be overestimating the accuracy of their predictions.

tcuongd commented 2 years ago

Bumping this as it'll be one of the next priorities for me :) Agree with @skannan-maf -- examples of the first case would be things like price or marketing spend budget, which we would know in advance for the forecast horizon, and examples of the second case being anything else we think is predictive of the target and easier to forecast than the target itself.

# adds regressor where we assume future values are always known
m.add_regressor(
    'digital_ad_spend',
    mode='additive',
    standardize=True,
)

One way to integrate the second case into the current API would be to add an additional argument, model, to the add_regressor() method. We could require the end user to fit the external regressor model before passing the model argument (a tad clunky but less work in the backend), or add an additional argument to the .fit() call to take a dictionary of dataframes, e.g. regressor_model_dfs, with the keys being the name of the regressor.

# adds regressor where future values are uncertain

# option 1, fit the regressor model first
weather_model = Prophet().fit(weather_df)
m.add_regressor(
    'avg_temperature',
    mode='multiplicative',
    standardize=True,
    model=weather_model,
)
m.fit(df)

# option 2, define all models upfront, then fit once
weather_model = Prophet()
m.add_regressor(
    'avg_temperature',
    mode='multiplicative',
    standardize=True,
    model=weather_model,
)
m.fit(df, regressor_dfs = {'avg_temperature': weather_df})

Next we'd need to tweak the cross_validation function to feed the yhats from these models into test periods instead of the actual values. This should also be done when predict() is called for future time values (this is a point of confusion that's been raised in the past where people don't know they have to provide the regressor values for the future period).

Finally, we could also incorporate uncertainty from the regressor model into the main model. In the sample_model() function, we have s_a and s_m as static inputs (i.e. the X values that get multiplied by the fitted beta coefficients), but we could also sample this from the posterior predictive distribution of the regressor model. This is probably important in principle (i.e. to reflect the full uncertainty forecasts that depend on unknown regressors), but I don't think it's a highly requested feature so we can leave this one in the backlog if it's too tricky.