Closed dsweber2 closed 3 days ago
based on feedback, I'm making this a separate step and making sure full-pipeline tests work. It will be a PR until both of those are done.
So I'm still putting the tests in a format that isn't a pile of spaghetti, but I wanted to get this out so we could maybe talk about it.
To implement a separate step, I couldn't find a way to get previous steps, so I basically had to communicate everything relevant through the headers of new_data
in the bake step. Hopefully modifying those doesn't cause problems.
While I was looking around I noticed that both step_epi_lag
and step_epi_ahead
were essentially the same; I made them call a separate function to highlight the actual differences. Couldn't quite reuse that for step_adjust_latency
unfortunately, though that has a similar thing where adjust_ahead
and adjust_lag
basically do the same thing.
There's still some downstream implications for how layer_forecast_date
works, specifically in the default case. We may have to add it as an option here too for it to work for the case of someone setting it manually using that step, which seems a bit unfortunate but kinda unavoidable
Getting a peculiar error where the local snapshot test for population scaling has 4 lines like this
Code
p <- predict(wf, latest)
Message
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)`
whereas the CI is apparently only generating 2. And this apparently changed within the past 2 weeks.
This was a change in {dplyr}
at some point that generates these if you don't specify by =
. I tried purposely to get rid of them. But maybe I missed some? Or your new code is recreating more?
A few other notes (we can discuss synchronously):
step_*()
function has access to the recipe is when it is added. At that point, it can "see" all previous steps. So inside of step_adjust_latency()
, you can immediately check to see if there are leads/lags in the current set of recipe instructions. In principle, you could then adjust them if desired. I'm not sure this handles everything, but I suspect it handles some of it.map(\(x), x+3)
style syntax was added in R 4.1. We've avoided forcing this Dependency on users to this point. I'm open to changing this (it's almost 3 years old), but we should discuss. This is also the reason for sticking with the {magrittr}
pipe %>%
rather than the native pipe |>
.{zoo}
dependency. There are situations where removal is always the right thing (creating NAs on future time_value
's is probably one of these), and our step_epi_lead()
does this currently. However, my previous practice has been to pass any NAs created by steps along. This way, additional steps can potentially be added to handle them (say, any of these, or similar time series fill versions). Simplest is the step_naomit()
functions. Is there a reason that this can't be done without the dependency or handled downstream?Overview of new changes:
stringr
and stringi
, added purrr
.zoo
is still present, waiting for us to decide on whether we want to trim or not.recipes_eval_select
. If the template is different from the training data this could cause problems, but I guess that's supposed to be true anyways.Wishlist item based on some dogfooding: if the minimum reporting latency across signals&geos is really large, say, a month or greater, that's worth a warning. E.g., FluSurv-NET has gaps in reporting between seasons, and I tried to forecast a new season before the first report actually was published using ahead adjustment, and happily tried to forecast the beginning of that season from the end of the preceding season, likely getting poor predictions.
@dsweber2 I'm going to make what may seem like an annoying request:
This processing seems sufficiently complicated that I think it should be on it's own, not in arx_forecaster()
. I don't think that arx_forecaster()
should accommodate every specialized step/layer that we end up creating. So maybe this needs to just be a new canned forecaster? What do you think?
I'm not sure it'd be different enough from arx_forecaster
to have as a completely separate version (comparing arx_forecaster
, arx_classifier
and... arx_latency_adjusted_forecaster
?). What part of the complication pushes you to not include it? Is your concern for having too many options for arx_forecaster
? Too much complexity to understanding the options? Maintaining the additional feature?
I'd be ok relegating it to just an advanced feature if necessary, though I do think without it, the forecast_date
and target_date
arguments are kind of half-baked.
My 2 cents:
forecast_date
s].My preference/request would that there be a canned forecaster that includes this feature. Could be arx_forecaster()
, could be an arx_forecaster_from_forecast_date()
(or better-named) alternative.
Side note: I also share some misgivings about the forecast_date
and target_date
settings... they always felt like stop-gaps until we had access to this latency/ahead-adjustment feature. I'm not sure they'd be needed anymore --- in either the max time_value
-relative or the true-forecast_date
-relative approaches --- once the adjustment approach is available; presumably, you'd only use the non-adjusted version when you actually want things to be relative to the max time_value
. [Also target_date
might make it awkward to support multiple aheads at once, which sounds very convenient + would make a nice uniform interface with smooth quantile_reg approaches.]
forecast_date
parameter does, which actually makes forecast_date
impact the model for the adjustment approach. That sounds better, though I'm still not sure if people will read it as an epix_as_of()
+ forecast, which it isn't. It might also be a convenient way to change the meaning of this label when doing max-time_value
-relative stuff... maybe the larger problem there is actually the existence of a default forecast_date
with a value that might surprise half of readers/users.]From our discussion yesterday, realized I overlooked an important possibility of someone newer forecasting without a Hub setup. Then there is no Hub validator to yell at you for forecasting a changing set of (forecast date relative) aheads when target latency changes / you pull data too early in the day, but you also may be able to say different aheads are fine. I think maybe the new forecasting examples Richard and Ryan N. are working on may be in that regime still and may give some insight; not sure if we might also have some experiences with epipredict precursors before running things in Hub validator. Ryan N. already ran into some confusion with (not easily being able to use epi_slide for forecasting and) the epix_slide time_value forecast_date target_date labeling, though the obvious issue there was (typing and recycling rules and) poor epiprocess labeling; I'm not sure if there is also an epipredict thing to worry about there.
Still probably don't want to streamline misrepresenting what the real target dates are in this case. (And some surprises might come if/when (a) [epipredict] setting forecast date label to be the max time value and that not being the weekday expected [based on a natural reading of the word], (b) [epipredict] using the true forecast date and target dates as labels and seeing the aheads changing when trying to retrospectively evaluate or plot, plus potential uneven spacing or duplicate target date values (per epikey-ahead-forecaster), (c) not using target date and assuming aheads inputted are relative to today in decisionmaking or plotting code and being a bit off, and/or (d) running forecasts too early in the day (stale forecast if AR, not sure what all can happen with ARX, and it's also an issue for the adjusted version but with different behavior). [(a), (b), and (d) AR might cause them to ask for support and learn. (c) and (d) ARX might not be immediately detected and degrade forecast/decision quality.])
Ok, pretty major revision done, and I think this is ready, modulo fixing some vignettes.
smoothqr
to the description, since apparently that was necessary, as well as grf_quantiles
to the pkgdown.ymlrlang::abort
-> cli_abort
conversions, as well as some cases where c
is used instead of paste
when the text is meant to continue across the line, rather than form a new bullet point.step_epi_ahead
and step_epi_lag
's backend to use the function add_shifted_columns
, since they're effectively the same function modulo a sign.epi_shift
, since it was completely unused (and the tests going with it).get_test_data
and forecast
since that kind of locf
is expressed in a method of step_adjust_latency
.forecast_date
and target_date
for arx_forecaster
are now used in the actual forecast, so if you set them to something so that there's induced latency, but don't select a method, it will warn about the difference. It will also error if forecast_date
, target_date
, and ahead
are all set and are inconsistent.arx_classifier
uses across(starts_with("ahead"...
, since we don't necessarily know the ahead apriori if they're using adjust_latency
. Behavior is the same though.layer_add_forecast_date
, in the absence of a specific date, inherits the date from step_adjust_latency
(or if its missing uses max_time_value
)layer_add_target_date
purrr
dependency, because there was some bug in map
that the patch was causing problems with.step_adjust_latency
modifies the epi_df
so that it has some extra metadata following it around:
latency_table
, a table of how much each column present at the time of calling step_adjust_latency
is latent relative to the forecast_date
metadata
, which is a cached version of the metadata present at prep time. This is only tacked on if it's not actually present, since non-epipredict steps typically strip this.
These are then used in both step_epi_ahead
and step_epi_lag
if the signs are right to adjust the shift value.forecast_date
, ahead
, target_date
inconsistencyforecast_date
is ahead of the max time valuelayer_add_forecast_date
step_adjust_latency
is used previously, it adopts the date from that (specifically the as_of
version)layer_add_target_date
step_adjust_latency
is used previously, it adopts the date from that (specifically the as_of
version)adjust_latency
step_adjust_latency
NA
removal early@dajmcdon or @brookslogan anything you would want to see here?
I've realized that there are some fixes I need to do for the vignettes before we can merge this, but that shouldn't take too long
ok, vignette fixed; there was a straggling fill_locf
, and a previous problem where some versioned data wasn't available for a given ref_time_value
.
Ok, a minor amount of changes necessary to that PR to get it working w/out modifying the way prep.epi_recipe works. Are we good to merge?
It’s still pretty big, so give me a bit of time to take another pass. Sent from my iPhoneOn Sep 20, 2024, at 10:13, David Weber @.***> wrote: Ok, a minor amount of changes necessary to that PR to get it working w/out modifying the way prep.epi_recipe works. Are we good to merge?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Checklist
Please:
DESCRIPTION
andNEWS.md
. Always increment the patch version number (the third number), unless you are making a release PR from dev to main, in which case increment the minor version number (the second number).Change explanations for reviewer
Implementation of latency adjustments as an argument to
step_epi_ahead
. The rough idea is to makeahead
relative to theas_of
date, rather than the last day of data. It throws a warning if this is too far in the future. Onlyextend_ahead
andNone
are currently implemented. Biggest downside is it only works for theepi_df
that the pipeline was created on.Still working on:
epi_df
and the end-to-end application works, rather than just the specialized functionextend_ahead
).step_adjust_ahead
locf
and associated testsextend_lags
we should talk about this one, it's a new-ish @lcbrooks idea. Sort of the opposite ofextend_aheads
. Setting the forecast date to beas_of
, per predictor, adjust the lags so that they're relative to the last day of data. E.g. if predictorA
has 3 days of latency, and the lags are given as(0,7,14)
, set the lags to(3,10,17)
, while ifB
has 5 days of latency, set the lags forB
to(5,12,19)
. Feel free to move discussion to the issueMagic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch