Open dajmcdon opened 8 months ago
Re side issue: why not
forecast
?
p <- forecast(ewf, tib)
Pros:
forecast
and that's it.Cons:
tib
twice, or dummy data + tib
. [Might look more natural if epi_recipe()
required spelling out something like template =
or ptype =
(if hardhat&recipes actually are doing something ptype-equivalent). But that requires breaking from recipe()
constructor mirroring...]additional_data
setup. But you could potentially augment it by allowing users to fit
& attach the training data, then predict
on an expanded time series (not get_test_data
output, just all rows, to be filtered down to the latest row per epikey in one of the final steps, after any lagging, 7daving, etc.) or forecast
with additional_data
. There may be a lot of tricky stuff trying to make these things work though. For example, if we try to do lag selection based on what's available at test time, we'd actually be doing it at fit
time & choosing a different lag set than we would working directly with the expanded time series. And we have some layers setting the forecast_date
labels that would have similar issues. And this entire paradigm might still be restrictive... we might see if it could accommodate online recalibration, Kalman filtering/smoothing, etc.operations performed on train-time data should save the necessary summary statistics to be reused on test-time data
We also saw in some hardhat/tidymodels reading another place where they say not to look at test data first / near train time, forget how they worded it. I think the caution we should take here is to maybe not save data at epi_workflow creation time. We handle not cheating via epix_slide()
.
Random thoughts:
epix_slide()
not by conforming to fit/predict paradigm.Looking into the details:
If additional data were passed, I suspect you would want to concatenate and then re prep/bake. We could probably add flags in various places that we inspect to determine if this is necessary, but that might get us back to the "need to adjust the new data handling specifically for every possible step" that we face now with get_test_data()
. We could likely back-burner this for now, and move such functionality to a separate issue.
Aside: if we are super concerned with space (not clear we are at the moment, seems a "nice to have" rather than a "mandatory for adding new features"), we may want to investigate ways to use {butcher}
for existing workflows.
If we throw out space considerations and there are no recalculation issues, there are still interface issues regarding storing/using the template data.
fit()
and forecast()
. Maybe allow additional_data
to predict()
. Pro: can use dummy template data without confusing error messages. Con: simple production forecaster requires feeding same data in twice; once into epi_recipe()
, once into forecast()
.forecast()
ing without fit()
ing first. Require new_data
to be fed into fit()
, and only allow additional_data
with predict()
. (Con: might forget to pass data to forecast()
when sliding & accidentally make production forecasts instead of exploration forecasts. Fairly easily noticed due to duplication. Con?: more than one way to forecast()
might invite confusion.)fit()
and forecast()
default to using it but allow user to pass a data/new_data
arg, maybe allow additional_data
to predict()
. (Same Con as Option 2. Might also happen with fit()
. Still seems fairly easily noticed. Same Con? as well bit also for fit()
.)fit()
and forecast()
use it and not allow a data/new_data
arg, maybe allow additional_data
to forecast()
and predict()
. Con: have to construct the epi_recipe()
after having the desired data available. User would have to make their own abstraction for a forecaster if you want to define it earlier.After realizing that Option 2 and Option 3's main con is fairly easily detectable after the fact (I think you get predictions for the wrong times, not cheating predictions for the right times), the only no-go seems to be Option 4. This would change if predict
took new_data
instead of additional_data
; Option 3 would also look bad then.
I'm still catching up to the discussion here, so apologies if my comments are missing something obvious here. Something that's not quite clear to me is how storing all the data with the original recipe helps us remove the difficult logic that's in get_test_data()
at the moment. I think I'm missing something, because it seems to me that the new forecast()
function will need to do much of the same "check step flag and do window length arithmetic" work, no?
On the options for interface:
Meta:
The fit(object, data, ...)
method for {workflows}
requires data, so ours does as well currently (we inherit from their class). The predict(object, new_data,...)
method for {workflows}
has a required new_data
arg, so we do as well. This is the "test-time data". So we can't really alter those. In generics, forecast(object,...)
would allow us to add an additional_data
argument, though this conflicts with the signature in {fable}
, which is either forecast(object, new_data,...)
or forecast(object, new_data = NULL)
. But we don't do anything else from {fable}
so maybe we don't care.
To me, the main interface question is "when do you call forecast()
"?. I really see 2 options:
forecast()
with an unfit workflow. This requires (a) the workflow has a fit module present; and (b) there is training data somewhere. (a) is easy to validate. (b) could be solved by either storing the template data in the recipe or by requiring the full training data (plus any additional data) as forecast(object, data)
. I don't like requiring data at forecast time. So I'd prefer storing the data somewhere in this case to at least maintain something similar the {fable}
version: forecast(object, additional_data = NULL)
. Internal implementation: forecast()
would call fit()
first, then do data munging to call predict()
.forecast()
an unfit workflow. Now the fitting has been done, but we have to store either the un-processed data to reprocess, or the processed data so that we can get the correct stuff to call predict()
. Implementation wise, it doesn't matter which. We still use forecast(object, additional_data = NULL)
.Either of these are maybe closest to Option 4? I think my option 2 is a bit more flexible, because you can always train with any data you want and then forecast from it. But my option 1 is more user-friendly because you only pass in training data once. The option 2 flexibility is still "available" to you, because that's how things currently work: you have to call fit()
, then create your test data, then call predict()
.
@dshemetov This is potentially correct, but hidden from the user. Also important. If there is stored data, when calling forecast()
, you would actually predict all time-values, then slim down to only those that are in the future. The current implementation requires reprocessing at predict time, so it's better computationally to only predict with the data you need rather than all time-values. If the template data is stored, and the workflow is unfit, then we process once, fit, then forecast. If the processed data (prepped/baked) is stored (as it is currently...), then if additional_data = NULL
, we reuse the processed data and forecast. If additional_data
is not null, we would, HAVE to reprocess for some recipe steps (anything that operates globally). This is easy if the template is stored, but difficult (perhaps impossible?) if the processed data is stored.
So, now having thought through @dshemetov question and writing it down, the choice of which to store and when impacts the available options. I think that forecasting a fit workflow should not allow additional_data
. I suggest we store the template data (allowing forecast(object, additional_data = new_epi_df)
) until it gets fit. Then, we store the processed data, throw out the template, and require forecast(object, additional_data = NULL)
. This allows forecast to be independent of whether the workflow has been fit, but always requires storing some data.
To deal with the fact that (by default) we would be storing data in the workflow, we could also include somme help. First, we add an argument to epi_recipe()
that allows the user to turn off storing the template. Aside outside of {epipredict}
, {tidymodels}
users are (unbeknownst to them) storing the template data. The way to avoid this is by creating a recipe from only the first row of your training data (this is not documented that I know of, and would require an expert user). end Aside Additionally, we add a function (or function) called axe_template()
or similar that removes the template from the workflow object, and possibly another that removes the processed data from a fit workflow.
So above, I think I got mixed up and felt part of any change here would also include:
new_data
to be the full time series, not the output of get_test_data()
. [This might also require some special invention of a required but unused predictor observed up until the max time_value across used(?) predictors(?), so that lags = 14, ahead = 14
, won't get test data in the same way as lags = 0, ahead = 28
.]predict.epi_workflow()
, in a new forge.epi_df()
method, or (probably a bad idea or impossible) a step_filter_to_latest_time_values
.Thus why I kept talking about predict()
in the options above. Sorry to confuse. But I don't think it'd be that bad to add a default to new_data
(or even an additional_data
optional argument) to predict.epi_workflow()
ewf <- fit(ewf, tib)
p <- predict(ewf)
or, with the changes I was imagining above, to do
ewf <- fit(ewf, tib)
p <- predict(ewf, new_data = tib)
basically just turning predict()
into forecast()
-on-fit-workflows.
But for the forecasters I can think of, we'd normally want to just forecast()
directly on unfit workflows (@dajmcdon's Option 1), unless we needed to debug fit coefficients. But I do think there are some other design questions (@dajmcdon 's last 3 paragraphs). For example, even assuming we don't allow additional_data
:
forecast()
time, or do you make it optional and try to grab it from the template, or do you disallow it and always grab it from the template?
forecast()
--- user must either provide dummy data (type signature or a row or...) as template and real data to forecast()
, or must pass the real data to both. (Package can axe the template.) This is most in line with tidymodels.forecast()
--- user has same options as "Never template" plus passing real data as template & no data to forecast()
. More convenient for production, but more error-prone for exploration. (But the possible errors seem relatively easy to detect: production forecasts duplicated N times rather than N exploration forecasts, not silent subtle cheating.) (Package may need to detect user error passing dummy data as template + no data to forecast()
.)forecast()
--- user must provide real data in template, and no data in forecast()
. Simpler (less flexible) interface, but probably annoying for exploration since you need the real data on hand at workflow construction time.fit()
, and for forecast()
/predict()
on fit workflows.)@dajmcdon I think you're proposing "Maybe in template, maybe in forecast()
" + things in line with this for fit()
and forecast()
-on-fit-workflows. Is that right?
One more thought here. While I still think we can probably get away without time window logic for version-unaware forecasters, it would likely be useful for efficiency purposes for preparing version-aware training sets, if we really buy into recipes for preprocessing. For version-aware forecasting, we very commonly want to line up test instance data with "analogous" versioned training data, particularly lags, or lags of 7-day-averages. We could
prep
part.)In all of these cases, there's the question of how to do things efficiently. In the first two cases, we can use something like epix_slide(before = max_lag_relative_to_forecast_date + max_before_value_used_for_averaging)
to try to be more efficient. But for generic epi_recipes, we don't know what window to ask for. If there were a way to ask for a time window needed to bake(?) an epi_recipe
, with Inf as a fallback if some unknown steps are involved, then we might be able to do things more efficiently.
There's yet another context where we might want to know the time window needed for a computation, and that's archive -> archive slides (like we want here). Though I think Dmitry/David pointed out we could try something time-window-unaware first and see if it actually is slow. Plus this could be even more complicated, because it's probably about prep + bake, not just bake.
This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.
Current behaviour:
Alternative, non-forecast as currently implemented. Not used, really, but should work:
Proposed adjustment:
Side issue: inheritance from
{tidymodels}
means that we store template information about the original data frame in theepi_recipe
S3 object.{recipes}
stores the entire data. Anepi_recipe
only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the{recipes}
behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (td
).Storing the training data would help here. However,
{tidymodels}
actually doesn't want to merge train-time and test-time data because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).forecast()
would only need the workflow as an argument, though we could potentially allow an optionaladditional_data
argument. They that would be added to the train-time data with the forecast now produced after the end of theadditional_data
.