cmu-delphi / epipredict

Tools for building predictive models in epidemiology.
https://cmu-delphi.github.io/epipredict/
Other
8 stars 9 forks source link

Handling insufficient training data. #272

Closed dajmcdon closed 8 months ago

dajmcdon commented 10 months ago

@dsweber2 @dshemetov

Brainstorming about how epipredict should handle insufficient training data.

A workflow is a container with (1) preprocessor; (2) engine; (3) postprocessor. By itself, it hasn't seen any data (training or test). When fit() is called on a workflow + training data, the preprocessor processes that training data.

Typically, a check_*() in the recipe (when placed after any lagging/leading/growth rate calculations), would error with a message like "insufficient training data". We could make this a warning (or perhaps have an argument to the check that determines whether we get a warning or an error). The most obvious result in that case, would be an "unfit workflow". The downstream impact is that you can't predict() an unfit workflow (this produces an error through it's workflow class). This makes sense typically, but perhaps not in the "retrospective evaluation task". Would we actually prefer that predict(unfit_workflow, new_data) produce another warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data but NA predictions?

It's possible that the new_data could have insufficient information to create the necessary lags. So from a fit workflow, we would not be able to produce a prediction. get_test_data() tries to ensure that there is enough (and only enough), but you could pass in your own (say if you don't wish to refit the model but predict using the most recent available data with the old model). So we could add a check_*() here as well. Again, now should predict(fit_workflow, bad_new_data produce a warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data but NA predictions?

Finally, we have arx_forecaster() (or any of the other canned forecasters). It's not just a workflow, but also sees the training data (and the testing data) at workflow build time. So we have more flexibility. So we could do any combination of the above options, or we could check at the beginning, outside the workflow, and return an appropriate object. For your current task, you're only using $predictions, so the rest doesn't matter (it also returns the fit workflow and a list of metadata). So we could just skip the postprocessor and return an empty tibble or NA predictions (even if we do something different in general).

I suspect there are some other paths in the tree I'm not thinking of at the moment. What do you guys think?

dsweber2 commented 10 months ago

Modifying arx_forecaster behavior is definitely the least involved, and catches the problem for new users (since it comes up any time you combine epi*_slide and epipredict, I suspect this is in a fairly common problem?). It has the downside of increasing the difficulty of moving off of arx_forecaster to anything handmade though, which is already pretty steep.

Would we actually prefer that predict(unfit_workflow, new_data) produce another warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data but NA predictions?

I'm mostly a fan of this; a silent result with NA predictions does tell the user that nothing was actually predicted on that day, and fits with other epi_slide results that @XuedaShen mentioned. Though modifying the way workflow is handled does seem a bit risky. A check would be a less invasive way, but then requires the most fiddling on the part of the end user to make sure that they're always handing enough data. If the check could communicate exactly how many days of data they need that could be less of a problem. I guess this is per-engine though, and somewhat hard to know apriori. In practice what I've had to do was just keep running with guesses for an allowable amount of data, which is definitely not ideal.

It's possible that the new_data could have insufficient information to create the necessary lags.

For the usecases we've run into so far, what we do here will for the most part be downstream of what we choose above. The case you describe doesn't seem unreasonable though, and having it produce the same kind of behavior as the above cases is probably good from a consistency perspective.

dajmcdon commented 10 months ago

Perhaps better to discuss in-person? Or someone could attempt to implement the function and we can iterate from there.