cmu-delphi / epipredict

Tools for building predictive models in epidemiology.
https://cmu-delphi.github.io/epipredict/
Other
8 stars 9 forks source link

Add `forecast()` method and "deprecate" `get_test_data()` #293

Open dajmcdon opened 8 months ago

dajmcdon commented 8 months ago

This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.

Current behaviour:

tib <- tibble(
  time_value = c(1:10, 1:10), geo_value = rep(letters[1:2], each = 5), 
  x = rnorm(20), y = rnorm(20)
) |> as_epi_df()
r <- epi_recipe(tib) |> # stores a data template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in r updates the workflow, returning a new "fitted" workflow
td <- get_test_data(r, tib) # grabs the necessary rows of tib so that we can process it based on r and 
    # then produce a single prediction at the latest time value for all keys (geo_value + any others 
    # in the metadata) depending on the steps in `r`, this is likely tib[c(5, 10), ] plus some additional
    # preceeding time values
p <- predict(ewf, new_data = td) # produces a forecast because we used the "tail" of the training data 

Alternative, non-forecast as currently implemented. Not used, really, but should work:

tib2 <- tib[0, ]
r <- epi_recipe(tib2) |> # stores a data template, no difference in behaviour from the above
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in `r`, updates the workflow, returning a new "fitted" workflow
tda <- tib[c(1, 6), ] # first time value in each geo
p1 <- predict(ewf, new_data = tda) # this will work, assuming that we can create the desired 
    # leads/lags (as specified in r) with tda

Proposed adjustment:

r <- epi_recipe(tib) |> # stores all the data, not just the template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
p <- forecast(ewf) # automatically treat the stored template as training data, process it, 
    # fit the workflow, then predict only the future. Note that horizons are specified in the recipe.
    # no "test-time" data is needed

Side issue: inheritance from {tidymodels} means that we store template information about the original data frame in the epi_recipe S3 object. {recipes} stores the entire data. An epi_recipe only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the {recipes} behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (td).

Storing the training data would help here. However, {tidymodels} actually doesn't want to merge train-time and test-time data because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).

forecast() would only need the workflow as an argument, though we could potentially allow an optional additional_data argument. They that would be added to the train-time data with the forecast now produced after the end of the additional_data.

brookslogan commented 8 months ago

Re side issue: why not

Cons:

operations performed on train-time data should save the necessary summary statistics to be reused on test-time data

We also saw in some hardhat/tidymodels reading another place where they say not to look at test data first / near train time, forget how they worded it. I think the caution we should take here is to maybe not save data at epi_workflow creation time. We handle not cheating via epix_slide().

Random thoughts:

brookslogan commented 7 months ago
Potentially related feature/application to think about below: combining models into joint via sampling. We're considering trying out separate backfill-projection&missingness-imputation and forecasting steps. One way to get reasonable uncertainty propagation in this method is to sample many trajectories (FinalizedPast | Provisional) from the backfill projection model, then take each one of those partial trajectories and tack on a sample from FinalizedFuture | FinalizedPast, then finally marginalize the samples into samples for each target (TargetFn(FinalizedFuture, FinalizedPast)) & kernel smooth to get a distribution for each target. This isn't the only way you might combine these two types of models, but seems like one of the simplest ways. It'd be great if there was something within epipredict to do this sort of composition for you, but if there's not, then having a `fit` + `predict` w/ additional data would be handy; you could `fit` the FinalizedFuture | FinalizedPast model once and then, once per draw from your other model for FinalizedPast | Provisional, `predict` on (FinalizedPastModelDraw, FurtherPastThatWasAlreadyFinalizedOrClose). Notes: - This seems to favor a separate `fit` and `predict` so you don't have to repeatedly run quantile regressions. As for storing processed vs. unprocessed data vs. no data: - but you don't want `fit` to record the wrong latency info if you might you're including missingness imputation. - Would we want to re-prep, re-bake, or neither when predicting on each of the draws? - Not re-`prep`; that could change scaling and invalidate the `model_spec`'s fit. - Re-`bake`: sounds most robust (e.g., calculate 7dav covariates for forecaster when backfill model outputs 1d), but also costly (calculating 7davs repeatedly on entire past + projection) without some step-specific special logic. - Neither: puts more weight on user to either: - Make backfill model target columns from baked data going into forecasting model. - Make a distinction between truth data and backfill-projected data in the model; backfill-projected data would need to be explicitly treated as a separate column, and if e.g. 7davs are needed, dealt with manually or via some special steps. This would probably look a bit more complicated, but better highlight a bit of a model mismatch (when fitting, we're using real FinalizedPast, and when predicting, we're subbing in draws from an _imperfect_ model for FinalizedPast | ProvisionalData, not the "true" conditional distribution), and any 7daving done here would be more efficient (though more error-prone) since it may naturally avoid repeating 7daving on windows that don't include backfill-imputed data.
dajmcdon commented 7 months ago

Looking into the details:

  1. If the workflow has been fit, then I can access the prepped/baked objects. These are already stored. So no reprocessing would be needed. I'm not immediately clear on why I stored it. I think I imagined it could be necessary for some types of post-processing. But I don't recall if it's used. So it is potentially useful here, but we might want to investigate work arounds. Given this, implementing the forecast method may be exceedingly simple.
  2. If the workflow has NOT been fit, one could fit it if an engine is in the workflow, and proceed. So again, no reprocessing. Just 1-time processing.

If additional data were passed, I suspect you would want to concatenate and then re prep/bake. We could probably add flags in various places that we inspect to determine if this is necessary, but that might get us back to the "need to adjust the new data handling specifically for every possible step" that we face now with get_test_data(). We could likely back-burner this for now, and move such functionality to a separate issue.

Aside: if we are super concerned with space (not clear we are at the moment, seems a "nice to have" rather than a "mandatory for adding new features"), we may want to investigate ways to use {butcher} for existing workflows.

brookslogan commented 7 months ago

If we throw out space considerations and there are no recalculation issues, there are still interface issues regarding storing/using the template data.

After realizing that Option 2 and Option 3's main con is fairly easily detectable after the fact (I think you get predictions for the wrong times, not cheating predictions for the right times), the only no-go seems to be Option 4. This would change if predict took new_data instead of additional_data; Option 3 would also look bad then.

dshemetov commented 7 months ago

I'm still catching up to the discussion here, so apologies if my comments are missing something obvious here. Something that's not quite clear to me is how storing all the data with the original recipe helps us remove the difficult logic that's in get_test_data() at the moment. I think I'm missing something, because it seems to me that the new forecast() function will need to do much of the same "check step flag and do window length arithmetic" work, no?

dajmcdon commented 7 months ago

On the options for interface:

Meta:

The fit(object, data, ...) method for {workflows} requires data, so ours does as well currently (we inherit from their class). The predict(object, new_data,...) method for {workflows} has a required new_data arg, so we do as well. This is the "test-time data". So we can't really alter those. In generics, forecast(object,...) would allow us to add an additional_data argument, though this conflicts with the signature in {fable}, which is either forecast(object, new_data,...) or forecast(object, new_data = NULL). But we don't do anything else from {fable} so maybe we don't care.

To me, the main interface question is "when do you call forecast()"?. I really see 2 options:

  1. You can forecast() with an unfit workflow. This requires (a) the workflow has a fit module present; and (b) there is training data somewhere. (a) is easy to validate. (b) could be solved by either storing the template data in the recipe or by requiring the full training data (plus any additional data) as forecast(object, data). I don't like requiring data at forecast time. So I'd prefer storing the data somewhere in this case to at least maintain something similar the {fable} version: forecast(object, additional_data = NULL). Internal implementation: forecast() would call fit() first, then do data munging to call predict().
  2. You cannot forecast() an unfit workflow. Now the fitting has been done, but we have to store either the un-processed data to reprocess, or the processed data so that we can get the correct stuff to call predict(). Implementation wise, it doesn't matter which. We still use forecast(object, additional_data = NULL).

Either of these are maybe closest to Option 4? I think my option 2 is a bit more flexible, because you can always train with any data you want and then forecast from it. But my option 1 is more user-friendly because you only pass in training data once. The option 2 flexibility is still "available" to you, because that's how things currently work: you have to call fit(), then create your test data, then call predict().

@dshemetov This is potentially correct, but hidden from the user. Also important. If there is stored data, when calling forecast(), you would actually predict all time-values, then slim down to only those that are in the future. The current implementation requires reprocessing at predict time, so it's better computationally to only predict with the data you need rather than all time-values. If the template data is stored, and the workflow is unfit, then we process once, fit, then forecast. If the processed data (prepped/baked) is stored (as it is currently...), then if additional_data = NULL, we reuse the processed data and forecast. If additional_data is not null, we would, HAVE to reprocess for some recipe steps (anything that operates globally). This is easy if the template is stored, but difficult (perhaps impossible?) if the processed data is stored.

So, now having thought through @dshemetov question and writing it down, the choice of which to store and when impacts the available options. I think that forecasting a fit workflow should not allow additional_data. I suggest we store the template data (allowing forecast(object, additional_data = new_epi_df)) until it gets fit. Then, we store the processed data, throw out the template, and require forecast(object, additional_data = NULL). This allows forecast to be independent of whether the workflow has been fit, but always requires storing some data.

To deal with the fact that (by default) we would be storing data in the workflow, we could also include somme help. First, we add an argument to epi_recipe() that allows the user to turn off storing the template. Aside outside of {epipredict}, {tidymodels} users are (unbeknownst to them) storing the template data. The way to avoid this is by creating a recipe from only the first row of your training data (this is not documented that I know of, and would require an expert user). end Aside Additionally, we add a function (or function) called axe_template() or similar that removes the template from the workflow object, and possibly another that removes the processed data from a fit workflow.

brookslogan commented 7 months ago

So above, I think I got mixed up and felt part of any change here would also include:

Thus why I kept talking about predict() in the options above. Sorry to confuse. But I don't think it'd be that bad to add a default to new_data (or even an additional_data optional argument) to predict.epi_workflow()

ewf <- fit(ewf, tib)
p <- predict(ewf)

or, with the changes I was imagining above, to do

ewf <- fit(ewf, tib)
p <- predict(ewf, new_data = tib)

basically just turning predict() into forecast()-on-fit-workflows.

But for the forecasters I can think of, we'd normally want to just forecast() directly on unfit workflows (@dajmcdon's Option 1), unless we needed to debug fit coefficients. But I do think there are some other design questions (@dajmcdon 's last 3 paragraphs). For example, even assuming we don't allow additional_data:

@dajmcdon I think you're proposing "Maybe in template, maybe in forecast()" + things in line with this for fit() and forecast()-on-fit-workflows. Is that right?

brookslogan commented 3 months ago

One more thought here. While I still think we can probably get away without time window logic for version-unaware forecasters, it would likely be useful for efficiency purposes for preparing version-aware training sets, if we really buy into recipes for preprocessing. For version-aware forecasting, we very commonly want to line up test instance data with "analogous" versioned training data, particularly lags, or lags of 7-day-averages. We could

In all of these cases, there's the question of how to do things efficiently. In the first two cases, we can use something like epix_slide(before = max_lag_relative_to_forecast_date + max_before_value_used_for_averaging) to try to be more efficient. But for generic epi_recipes, we don't know what window to ask for. If there were a way to ask for a time window needed to bake(?) an epi_recipe, with Inf as a fallback if some unknown steps are involved, then we might be able to do things more efficiently.

There's yet another context where we might want to know the time window needed for a computation, and that's archive -> archive slides (like we want here). Though I think Dmitry/David pointed out we could try something time-window-unaware first and see if it actually is slow. Plus this could be even more complicated, because it's probably about prep + bake, not just bake.