`get_test_data()` does not consider lagged differences of lagged differences or lags of lags

brookslogan commented 1 month ago

@rnayebi21 was trying to calculate second differences with step_lag_difference() + another step_lag_difference() on the generated output. But get_test_data()'s horizon processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing). The result is too short a time window for predictions. E.g., lagged differencing with horizon = 7 followed by another horizon = 7 will make get_test_data() filter to around 8 days, but we actually need around 15 days. Additionally, the eventual output error message appears to be deeply nested and unhelpful, from stopifnot(length(values) == length(quantile_levels)).

Potential resolutions:

Modify step_epi_shift(), step_lag_difference(), and step_growth_rate() to do some additional tagging of outputs based on the shift range [given #362, maybe actually a shift set] they depend on + check their inputs for such tags and consider it in that logic, and extract that info in get_test_data(). And make sure to appropriately label step_lag_difference()s operation as a lag, not a horizon (the current horizon naming seems like a hack to make lag_difference + lag work). Maybe also export some related utilities to let developers manipulate these tags if they want to create new steps.
- Example A:
- Lag-difference X by 7 to get Delta_X --- tag Delta_X with a shift range of [-7, 0] [shift set of {-7, 0}]
- Lag-difference Delta_X to get DeltaSquared_X --- tag DeltaSquared_X with a shift range of [-14, 0] [shift set of {-14, -7, 0}]
- Example B:
- Lag X by 7 to get X_lag_7 --- tag X_lag_7 with a shift range of [-7, -7] [shift set of {-7}]
- Lag-difference X_lag_7 to get Delta_X_lag_7 --- tag Delta_X_lag_7 with a shift range of [-14, -7] [shift set of {-14, -7}]
Try to avoid get_test_data() altogether e.g. with this sort of approach.
Hybrid approach:
- Tag all original inputs with the shift range [0, 0] [shift set of {0}]. (That probably simplifies the first resolution's logic anyway.)
- Use the first resolution.
- In/around get_test_data(), if not all relevant variables have shift ranges, then assume a shift range of (-Infty, Infty) [or a special value representing this / branch handling this case if using shift range approach]. Bake the data, getting extra rows. But then filter those baked results to just the latest time value (or latest time_value with nonmissing predictors?? not sure) per epikey.
- Alternatively, ALWAYS apply this filter step even if we think it's supposed to be a no-op.
- We could also check and issue better error messages if get_test_data() + baking causes epikeys to drop out of the data set, especially if it causes all epikeys to drop out of the data set. (By an epi_key dropping out, I mean something like drop_na(baked_test_data, all_predictors()) yielding 0 rows with that epikey when that epikey was "present" in the "original" data.... for some definitions of "present" and "original"....)

brookslogan commented 1 month ago

Potential workaround in some cases: add a step_epi_lag on 0 variables with a lag value that corrects the range. E.g. in Example A above, lag = 14.

brookslogan commented 1 month ago

Updated this to describe an approach tracking shift sets rather than shift ranges. This could be useful in the context of:

Addressing #362. The precise call to complete() we want requires knowing/guessing the period and maybe some tricky behaviors for certain time_value classes. Though epiprocess#472 is moving us closer to a simpler world there, where there are fewer classes allowed and guess_time_type() and guess_period() would hopefully correspond.
Potentially facilitating more computationally-efficient approaches if we wanted to allow epi_recipes as f arguments in epix_slide().

brookslogan commented 1 month ago

@rnayebi21 also encountered an issue when manually calculating lags 7 and 14 of a signal, then using step_lag_difference to prepare lag 7 - lag 14; I'm guessing maybe the issue might be related to a sort of implicit assumption in get_test_data() expecting that we'd also be asking for lag 14 - lag 21.

rnayebi21 commented 1 month ago

Potential workaround in some cases: add a step_epi_lag on 0 variables with a lag value that corrects the range. E.g. in Example A above, lag = 14.

This workaround didn't end up working. I tried lagging on 0 variables and got "object 'value' not found"

rnayebi21 commented 1 month ago

@rnayebi21 also encountered an issue when manually calculating lags 7 and 14 of a signal, then using step_lag_difference to prepare lag 7 - lag 14; I'm guessing maybe the issue might be related to a sort of implicit assumption in get_test_data() expecting that we'd also be asking for lag 14 - lag 21.

Current workaround that has worked for me is to use the following instead: step_mutate(covar_7_14 = lag_7_value - lag_14_value, role = "predictor")

Also for context, the error occurs when I'm using an ahead larger than 24, but only occurs in the step_lag_difference approach and not the step_mutate approach. This connects to @brookslogan's theory on the implicit use of lag 21 in step_lag_difference, because when my ahead is larger than 24, the use of the 21st lag alone causes a singular design matrix error.

dajmcdon commented 1 month ago

In terms of the workaround, the the following should work:

(updated to make slightly more similar to your case)

library(epipredict)
jhu <- case_death_rate_subset %>%
  filter(time_value >= "2021-01-01", geo_value %in% c("ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(case_rate, lag = 7L) %>%
  step_lag_difference(lag_7_case_rate, horizon = 7, prefix = "one") %>%
  step_lag_difference(starts_with("one"), horizon = 7, prefix = "two") %>%
  step_epi_lag(case_rate, lag = 21, prefix = "rm") %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  recipes::step_rm(starts_with("rm"))

frost <- frosting() %>% layer_naomit(.pred)

wf <- epi_workflow(r, linear_reg(), frost)
fitted <- fit(wf, jhu)

forecast(fitted) %>% filter(time_value == max(jhu$time_value))
#> An `epi_df` object, 2 x 3 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 15:08:25.791826
#> 
#> # A tibble: 2 × 3
#>   geo_value time_value .pred
#> * <chr>     <date>     <dbl>
#> 1 ca        2021-12-31 0.117
#> 2 ny        2021-12-31 0.192

brookslogan commented 1 month ago

For the original problem, I think we were looking at a role-selection or names-based workaround similar to the above. The roles-based thing failed due to the tidy selector not working, and the names-based I'm guessing ran into the lack of filter issue. [I guess I'm totally off here if there are singular matrices being created.]

[mentioned above already] But @rnayebi21 is trying out another alternative: use step_epi_lag + step_mutate(<nm> = lag_7_value - lag_14_value, role = "predictor"), which I think should avoid the need for filter.

cmu-delphi / epipredict

`get_test_data()` does not consider lagged differences of lagged differences or lags of lags #359