Add `forecast()` method and "deprecate" `get_test_data()`

cmu-delphi / epipredict

Tools for building predictive models in epidemiology.

https://cmu-delphi.github.io/epipredict/

Other

8 stars 9 forks source link

Add `forecast()` method and "deprecate" `get_test_data()` #293

Open dajmcdon opened 8 months ago

dajmcdon commented 8 months ago

This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.

Current behaviour:

tib <- tibble(
  time_value = c(1:10, 1:10), geo_value = rep(letters[1:2], each = 5), 
  x = rnorm(20), y = rnorm(20)
) |> as_epi_df()
r <- epi_recipe(tib) |> # stores a data template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in r updates the workflow, returning a new "fitted" workflow
td <- get_test_data(r, tib) # grabs the necessary rows of tib so that we can process it based on r and 
    # then produce a single prediction at the latest time value for all keys (geo_value + any others 
    # in the metadata) depending on the steps in `r`, this is likely tib[c(5, 10), ] plus some additional
    # preceeding time values
p <- predict(ewf, new_data = td) # produces a forecast because we used the "tail" of the training data

Alternative, non-forecast as currently implemented. Not used, really, but should work:

tib2 <- tib[0, ]
r <- epi_recipe(tib2) |> # stores a data template, no difference in behaviour from the above
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in `r`, updates the workflow, returning a new "fitted" workflow
tda <- tib[c(1, 6), ] # first time value in each geo
p1 <- predict(ewf, new_data = tda) # this will work, assuming that we can create the desired 
    # leads/lags (as specified in r) with tda

Proposed adjustment:

r <- epi_recipe(tib) |> # stores all the data, not just the template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
p <- forecast(ewf) # automatically treat the stored template as training data, process it, 
    # fit the workflow, then predict only the future. Note that horizons are specified in the recipe.
    # no "test-time" data is needed

Side issue: inheritance from {tidymodels} means that we store template information about the original data frame in the epi_recipe S3 object. {recipes} stores the entire data. An epi_recipe only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the {recipes} behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (td).

Storing the training data would help here. However, {tidymodels} actually doesn't want to merge train-time and test-time data because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).

forecast() would only need the workflow as an argument, though we could potentially allow an optional additional_data argument. They that would be added to the train-time data with the forecast now produced after the end of the additional_data.

brookslogan commented 8 months ago

Re side issue: why not

continue storing only a 0-row tibble
require test-time data on forecast?
```
p <- forecast(ewf, tib)
```
Pros:
You can construct your workflows with dummy data. Might help if you have separate exploration & production setups and want to have all your workflows constructed in some separate file that doesn't depend on the exploration/production data fetching.
You don't have to store data in the recipe; it's used within forecast and that's it.
You don't have to redo prep&bake steps.
You can't accidentally train on all the data when trying to backtest by constructing the epi_workflow on the latest data and then doing an epi_slide or epix_slide forecasting with that epi_workflow.

Cons:

It still feels awkward passing tib twice, or dummy data + tib. [Might look more natural if epi_recipe() required spelling out something like template = or ptype = (if hardhat&recipes actually are doing something ptype-equivalent). But that requires breaking from recipe() constructor mirroring...]
Doesn't immediately accommodate an additional_data setup. But you could potentially augment it by allowing users to fit & attach the training data, then predict on an expanded time series (not get_test_data output, just all rows, to be filtered down to the latest row per epikey in one of the final steps, after any lagging, 7daving, etc.) or forecast with additional_data. There may be a lot of tricky stuff trying to make these things work though. For example, if we try to do lag selection based on what's available at test time, we'd actually be doing it at fit time & choosing a different lag set than we would working directly with the expanded time series. And we have some layers setting the forecast_date labels that would have similar issues. And this entire paradigm might still be restrictive... we might see if it could accommodate online recalibration, Kalman filtering/smoothing, etc.

operations performed on train-time data should save the necessary summary statistics to be reused on test-time data

We also saw in some hardhat/tidymodels reading another place where they say not to look at test data first / near train time, forget how they worded it. I think the caution we should take here is to maybe not save data at epi_workflow creation time. We handle not cheating via epix_slide().

Random thoughts:

If you interpret our "test data" as being training data + any additional data, then maybe we're conforming, but since usually there is no additional data, we're in a weird situation.
Or maybe we'd be kind of violating that here, especially if we do things like dynamic lag selection. But tidymodels seems focused on a single train-test split, perhaps with a bit of consideration of a single train-validation-test split or (non-timeseries) CV splits, and doesn't really match our time series setup. Again, we're handling not cheating via epix_slide() not by conforming to fit/predict paradigm.

brookslogan commented 7 months ago

Potentially related feature/application to think about below: combining models into joint via sampling.

We're considering trying out separate backfill-projection&missingness-imputation and forecasting steps. One way to get reasonable uncertainty propagation in this method is to sample many trajectories (FinalizedPast | Provisional) from the backfill projection model, then take each one of those partial trajectories and tack on a sample from FinalizedFuture | FinalizedPast, then finally marginalize the samples into samples for each target (TargetFn(FinalizedFuture, FinalizedPast)) & kernel smooth to get a distribution for each target. This isn't the only way you might combine these two types of models, but seems like one of the simplest ways. It'd be great if there was something within epipredict to do this sort of composition for you, but if there's not, then having a `fit` + `predict` w/ additional data would be handy; you could `fit` the FinalizedFuture | FinalizedPast model once and then, once per draw from your other model for FinalizedPast | Provisional, `predict` on (FinalizedPastModelDraw, FurtherPastThatWasAlreadyFinalizedOrClose). Notes: - This seems to favor a separate `fit` and `predict` so you don't have to repeatedly run quantile regressions. As for storing processed vs. unprocessed data vs. no data: - but you don't want `fit` to record the wrong latency info if you might you're including missingness imputation. - Would we want to re-prep, re-bake, or neither when predicting on each of the draws? - Not re-`prep`; that could change scaling and invalidate the `model_spec`'s fit. - Re-`bake`: sounds most robust (e.g., calculate 7dav covariates for forecaster when backfill model outputs 1d), but also costly (calculating 7davs repeatedly on entire past + projection) without some step-specific special logic. - Neither: puts more weight on user to either: - Make backfill model target columns from baked data going into forecasting model. - Make a distinction between truth data and backfill-projected data in the model; backfill-projected data would need to be explicitly treated as a separate column, and if e.g. 7davs are needed, dealt with manually or via some special steps. This would probably look a bit more complicated, but better highlight a bit of a model mismatch (when fitting, we're using real FinalizedPast, and when predicting, we're subbing in draws from an _imperfect_ model for FinalizedPast | ProvisionalData, not the "true" conditional distribution), and any 7daving done here would be more efficient (though more error-prone) since it may naturally avoid repeating 7daving on windows that don't include backfill-imputed data.

dajmcdon commented 7 months ago

Looking into the details:

If the workflow has been fit, then I can access the prepped/baked objects. These are already stored. So no reprocessing would be needed. I'm not immediately clear on why I stored it. I think I imagined it could be necessary for some types of post-processing. But I don't recall if it's used. So it is potentially useful here, but we might want to investigate work arounds. Given this, implementing the forecast method may be exceedingly simple.
If the workflow has NOT been fit, one could fit it if an engine is in the workflow, and proceed. So again, no reprocessing. Just 1-time processing.

If additional data were passed, I suspect you would want to concatenate and then re prep/bake. We could probably add flags in various places that we inspect to determine if this is necessary, but that might get us back to the "need to adjust the new data handling specifically for every possible step" that we face now with get_test_data(). We could likely back-burner this for now, and move such functionality to a separate issue.

Aside: if we are super concerned with space (not clear we are at the moment, seems a "nice to have" rather than a "mandatory for adding new features"), we may want to investigate ways to use {butcher} for existing workflows.

brookslogan commented 7 months ago

If we throw out space considerations and there are no recalculation issues, there are still interface issues regarding storing/using the template data.

Option 1: store only template header. Require data to be fed into fit() and forecast(). Maybe allow additional_data to predict(). Pro: can use dummy template data without confusing error messages. Con: simple production forecaster requires feeding same data in twice; once into epi_recipe(), once into forecast().
Option 2: store template data, but only use it if forecast()ing without fit()ing first. Require new_data to be fed into fit(), and only allow additional_data with predict(). (Con: might forget to pass data to forecast() when sliding & accidentally make production forecasts instead of exploration forecasts. Fairly easily noticed due to duplication. Con?: more than one way to forecast() might invite confusion.)
Option 3: store template data, have fit() and forecast() default to using it but allow user to pass a data/new_data arg, maybe allow additional_data to predict(). (Same Con as Option 2. Might also happen with fit(). Still seems fairly easily noticed. Same Con? as well bit also for fit().)
Option 4: store template data, have fit() and forecast() use it and not allow a data/new_data arg, maybe allow additional_data to forecast() and predict(). Con: have to construct the epi_recipe() after having the desired data available. User would have to make their own abstraction for a forecaster if you want to define it earlier.
...

After realizing that Option 2 and Option 3's main con is fairly easily detectable after the fact (I think you get predictions for the wrong times, not cheating predictions for the right times), the only no-go seems to be Option 4. This would change if predict took new_data instead of additional_data; Option 3 would also look bad then.

dshemetov commented 7 months ago

I'm still catching up to the discussion here, so apologies if my comments are missing something obvious here. Something that's not quite clear to me is how storing all the data with the original recipe helps us remove the difficult logic that's in get_test_data() at the moment. I think I'm missing something, because it seems to me that the new forecast() function will need to do much of the same "check step flag and do window length arithmetic" work, no?

dajmcdon commented 7 months ago

On the options for interface:

Meta:

The fit(object, data, ...) method for {workflows} requires data, so ours does as well currently (we inherit from their class). The predict(object, new_data,...) method for {workflows} has a required new_data arg, so we do as well. This is the "test-time data". So we can't really alter those. In generics, forecast(object,...) would allow us to add an additional_data argument, though this conflicts with the signature in {fable}, which is either forecast(object, new_data,...) or forecast(object, new_data = NULL). But we don't do anything else from {fable} so maybe we don't care.

To me, the main interface question is "when do you call forecast()"?. I really see 2 options:

You can forecast() with an unfit workflow. This requires (a) the workflow has a fit module present; and (b) there is training data somewhere. (a) is easy to validate. (b) could be solved by either storing the template data in the recipe or by requiring the full training data (plus any additional data) as forecast(object, data). I don't like requiring data at forecast time. So I'd prefer storing the data somewhere in this case to at least maintain something similar the {fable} version: forecast(object, additional_data = NULL). Internal implementation: forecast() would call fit() first, then do data munging to call predict().
You cannot forecast() an unfit workflow. Now the fitting has been done, but we have to store either the un-processed data to reprocess, or the processed data so that we can get the correct stuff to call predict(). Implementation wise, it doesn't matter which. We still use forecast(object, additional_data = NULL).

Either of these are maybe closest to Option 4? I think my option 2 is a bit more flexible, because you can always train with any data you want and then forecast from it. But my option 1 is more user-friendly because you only pass in training data once. The option 2 flexibility is still "available" to you, because that's how things currently work: you have to call fit(), then create your test data, then call predict().

@dshemetov This is potentially correct, but hidden from the user. Also important. If there is stored data, when calling forecast(), you would actually predict all time-values, then slim down to only those that are in the future. The current implementation requires reprocessing at predict time, so it's better computationally to only predict with the data you need rather than all time-values. If the template data is stored, and the workflow is unfit, then we process once, fit, then forecast. If the processed data (prepped/baked) is stored (as it is currently...), then if additional_data = NULL, we reuse the processed data and forecast. If additional_data is not null, we would, HAVE to reprocess for some recipe steps (anything that operates globally). This is easy if the template is stored, but difficult (perhaps impossible?) if the processed data is stored.

So, now having thought through @dshemetov question and writing it down, the choice of which to store and when impacts the available options. I think that forecasting a fit workflow should not allow additional_data. I suggest we store the template data (allowing forecast(object, additional_data = new_epi_df)) until it gets fit. Then, we store the processed data, throw out the template, and require forecast(object, additional_data = NULL). This allows forecast to be independent of whether the workflow has been fit, but always requires storing some data.

To deal with the fact that (by default) we would be storing data in the workflow, we could also include somme help. First, we add an argument to epi_recipe() that allows the user to turn off storing the template. Aside outside of {epipredict}, {tidymodels} users are (unbeknownst to them) storing the template data. The way to avoid this is by creating a recipe from only the first row of your training data (this is not documented that I know of, and would require an expert user). end Aside Additionally, we add a function (or function) called axe_template() or similar that removes the template from the workflow object, and possibly another that removes the processed data from a fit workflow.

brookslogan commented 7 months ago

So above, I think I got mixed up and felt part of any change here would also include:

the reinterpretation of new_data to be the full time series, not the output of get_test_data(). [This might also require some special invention of a required but unused predictor observed up until the max time_value across used(?) predictors(?), so that lags = 14, ahead = 14, won't get test data in the same way as lags = 0, ahead = 28.]
the addition of some code to filter preprocessed data to the predictor rows with the latest time values for each epikey --- either in predict.epi_workflow(), in a new forge.epi_df() method, or (probably a bad idea or impossible) a step_filter_to_latest_time_values.

Thus why I kept talking about predict() in the options above. Sorry to confuse. But I don't think it'd be that bad to add a default to new_data (or even an additional_data optional argument) to predict.epi_workflow()

ewf <- fit(ewf, tib)
p <- predict(ewf)

or, with the changes I was imagining above, to do

ewf <- fit(ewf, tib)
p <- predict(ewf, new_data = tib)

basically just turning predict() into forecast()-on-fit-workflows.

But for the forecasters I can think of, we'd normally want to just forecast() directly on unfit workflows (@dajmcdon's Option 1), unless we needed to debug fit coefficients. But I do think there are some other design questions (@dajmcdon 's last 3 paragraphs). For example, even assuming we don't allow additional_data:

Do you require user to supply data at unfit forecast() time, or do you make it optional and try to grab it from the template, or do you disallow it and always grab it from the template?
- Never/ignored in template, always in forecast() --- user must either provide dummy data (type signature or a row or...) as template and real data to forecast(), or must pass the real data to both. (Package can axe the template.) This is most in line with tidymodels.
- Maybe in template, maybe in forecast() --- user has same options as "Never template" plus passing real data as template & no data to forecast(). More convenient for production, but more error-prone for exploration. (But the possible errors seem relatively easy to detect: production forecasts duplicated N times rather than N exploration forecasts, not silent subtle cheating.) (Package may need to detect user error passing dummy data as template + no data to forecast().)
- Always in template, never in forecast() --- user must provide real data in template, and no data in forecast(). Simpler (less flexible) interface, but probably annoying for exploration since you need the real data on hand at workflow construction time.
(Similar questions for fit(), and for forecast()/predict() on fit workflows.)

@dajmcdon I think you're proposing "Maybe in template, maybe in forecast()" + things in line with this for fit() and forecast()-on-fit-workflows. Is that right?

brookslogan commented 3 months ago

One more thought here. While I still think we can probably get away without time window logic for version-unaware forecasters, it would likely be useful for efficiency purposes for preparing version-aware training sets, if we really buy into recipes for preprocessing. For version-aware forecasting, we very commonly want to line up test instance data with "analogous" versioned training data, particularly lags, or lags of 7-day-averages. We could

continue to require users to do this on their own via a bunch of (error-prone / inconvenient) boilerplate inside a (nested) epix_slide (without a recipe),
provide specialized functions for lags / lags of 7-day-averages, and then craft a recipe around using those, and/or
provide specialized functions for recipes, to get "analogous" versioned training data, which would provide the above as a special case. (Maybe only for some recipes? I haven't thought about more complicated steps like center&scaling ones that involve an actual prep part.)

In all of these cases, there's the question of how to do things efficiently. In the first two cases, we can use something like epix_slide(before = max_lag_relative_to_forecast_date + max_before_value_used_for_averaging) to try to be more efficient. But for generic epi_recipes, we don't know what window to ask for. If there were a way to ask for a time window needed to bake(?) an epi_recipe, with Inf as a fallback if some unknown steps are involved, then we might be able to do things more efficiently.

There's yet another context where we might want to know the time window needed for a computation, and that's archive -> archive slides (like we want here). Though I think Dmitry/David pointed out we could try something time-window-unaware first and see if it actually is slow. Plus this could be even more complicated, because it's probably about prep + bake, not just bake.