cmu-delphi / epipredict

Tools for building predictive models in epidemiology.
https://cmu-delphi.github.io/epipredict/
Other
8 stars 9 forks source link

`get_test_data()` does not consider lagged differences of lagged differences or lags of lags #359

Open brookslogan opened 1 month ago

brookslogan commented 1 month ago

@rnayebi21 was trying to calculate second differences with step_lag_difference() + another step_lag_difference() on the generated output. But get_test_data()'s horizon processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing). The result is too short a time window for predictions. E.g., lagged differencing with horizon = 7 followed by another horizon = 7 will make get_test_data() filter to around 8 days, but we actually need around 15 days. Additionally, the eventual output error message appears to be deeply nested and unhelpful, from stopifnot(length(values) == length(quantile_levels)).

Potential resolutions:

brookslogan commented 1 month ago

Potential workaround in some cases: add a step_epi_lag on 0 variables with a lag value that corrects the range. E.g. in Example A above, lag = 14.

brookslogan commented 1 month ago

Updated this to describe an approach tracking shift sets rather than shift ranges. This could be useful in the context of:

brookslogan commented 1 month ago

@rnayebi21 also encountered an issue when manually calculating lags 7 and 14 of a signal, then using step_lag_difference to prepare lag 7 - lag 14; I'm guessing maybe the issue might be related to a sort of implicit assumption in get_test_data() expecting that we'd also be asking for lag 14 - lag 21.

rnayebi21 commented 1 month ago

Potential workaround in some cases: add a step_epi_lag on 0 variables with a lag value that corrects the range. E.g. in Example A above, lag = 14.

This workaround didn't end up working. I tried lagging on 0 variables and got "object 'value' not found"

rnayebi21 commented 1 month ago

@rnayebi21 also encountered an issue when manually calculating lags 7 and 14 of a signal, then using step_lag_difference to prepare lag 7 - lag 14; I'm guessing maybe the issue might be related to a sort of implicit assumption in get_test_data() expecting that we'd also be asking for lag 14 - lag 21.

Current workaround that has worked for me is to use the following instead: step_mutate(covar_7_14 = lag_7_value - lag_14_value, role = "predictor")

Also for context, the error occurs when I'm using an ahead larger than 24, but only occurs in the step_lag_difference approach and not the step_mutate approach. This connects to @brookslogan's theory on the implicit use of lag 21 in step_lag_difference, because when my ahead is larger than 24, the use of the 21st lag alone causes a singular design matrix error.

dajmcdon commented 1 month ago

In terms of the workaround, the the following should work:

(updated to make slightly more similar to your case)

library(epipredict)
jhu <- case_death_rate_subset %>%
  filter(time_value >= "2021-01-01", geo_value %in% c("ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(case_rate, lag = 7L) %>%
  step_lag_difference(lag_7_case_rate, horizon = 7, prefix = "one") %>%
  step_lag_difference(starts_with("one"), horizon = 7, prefix = "two") %>%
  step_epi_lag(case_rate, lag = 21, prefix = "rm") %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  recipes::step_rm(starts_with("rm"))

frost <- frosting() %>% layer_naomit(.pred)

wf <- epi_workflow(r, linear_reg(), frost)
fitted <- fit(wf, jhu)

forecast(fitted) %>% filter(time_value == max(jhu$time_value))
#> An `epi_df` object, 2 x 3 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 15:08:25.791826
#> 
#> # A tibble: 2 × 3
#>   geo_value time_value .pred
#> * <chr>     <date>     <dbl>
#> 1 ca        2021-12-31 0.117
#> 2 ny        2021-12-31 0.192
brookslogan commented 1 month ago

For the original problem, I think we were looking at a role-selection or names-based workaround similar to the above. The roles-based thing failed due to the tidy selector not working, and the names-based I'm guessing ran into the lack of filter issue. [I guess I'm totally off here if there are singular matrices being created.]

[mentioned above already] But @rnayebi21 is trying out another alternative: use step_epi_lag + step_mutate(<nm> = lag_7_value - lag_14_value, role = "predictor"), which I think should avoid the need for filter.