Add the 7dav we talked about along with the std

dsweber2 commented 8 months ago

Closes #74. This adds a new forecaster that creates a 7 day average column and a 28 day moving standard deviation column, and uses those as predictors. @brookslogan would like to hear your feedback about implementation if you've got time.

dsweber2 commented 8 months ago

first, thanks for the detailed commentary!

Implement a step_rolling_mean (or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.) Implement a step_rolling_sd to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".

I do think this is an eventual goal. As with the latency adjustment the turnaround time on adding things to epipredict is longer than writing them like this. If they perform well we should do as you're saying.

cache_metadata can probably be avoided through use of dplyr_reconstruct or epi_slide.

Only reason I defined this function was because select wasn't actually supported for epi_df's and epi_archives. Glad to see there are some built-in alternatives.

dsweber2 commented 8 months ago

re-implemented with epislide and added the stuff I marked as resolved.

Now I'm wondering; for days that are missing/NA, there are 3 main options I can see.

what I was doing initially, which is ignore them completely and take 7 points regardless of date
what epi_slide does by default, which is use them to determine the points used in the average
in addition, impute the values for missing days based on the 7dav around them. This may be NA for particularly bad days.

brookslogan commented 8 months ago

For compatibility with dev10, I'd think the choice would be to complete with NAs or check for an nrow of 7 else output NA. Latter approach wouldn't impact the number of input&output rows and would automatically produce NAs for at least the first 6 times, similar to mutate_scattered_ts, so that might be nice. NA seems like the safe-ish-est easy option, and gaps probably aren't that common(?) so this might be preferred?
- If latency varies by geo, then an imputation might let us do something less bad, but maybe this is the response of a latency-handling step/approach rather than a 7dav-er.
- (Perhaps we're also missing a step_gap_fill or something like that if someone wants e.g. 0s?)
Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

[Seems like a lot of this we may want to proceed & test before trying to refactor/tweak. I still need to finish reading some files & try to actually sanity check more concrete things.]

dsweber2 commented 8 months ago

For compatibility with dev10...

Ok, so I made the imputation thing a standalone issue in the icebox. For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

Definitely. This is basically ready to be tested on real data, modulo anything you catch that isn't currently enumerated in the tests, whereas I don't think we'd have this operation for another month in epipredict, between the coordination problems and dealing with the complications that come from extra framework requirements.

brookslogan commented 8 months ago

For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Warning: we want that to be the epi_slide default, but it's not yet.

library(dplyr)
library(epiprocess)
tibble(geo_value = 1, time_value = c(1, 4,5,6), value = time_value) %>%
  as_epi_df() %>%
  epi_slide(mv = mean(value), before = 1)
#> An `epi_df` object, 4 x 4 with metadata:
#> * geo_type  = hhs
#> * time_type = custom
#> * as_of     = 2023-12-22 07:58:58
#> 
#> # A tibble: 4 × 4
#>   geo_value time_value value    mv
#>       <dbl>      <dbl> <dbl> <dbl>
#> 1         1          1     1   1  
#> 2         1          4     4   4  
#> 3         1          5     5   4.5
#> 4         1          6     6   5.5

cmu-delphi / exploration-tooling

Add the 7dav we talked about along with the std #76