cmu-delphi / epipredict

Tools for building predictive models in epidemiology.
https://cmu-delphi.github.io/epipredict/
Other
8 stars 8 forks source link

Determine the source of `group_by` dummy variable creation #335

Closed dsweber2 closed 1 month ago

dsweber2 commented 1 month ago

If you create a eip_workflow using an epi_df that has been grouped by geo_value (or any epi_key), instead of generating a model per geo, you get one model with the geo_value treated as a dummy variable (so one indicator per unique value). This is equivalent to adding a step_dummy(geo_value) as a step.

If we want to keep and/or support this, we should make sure to document it, along with how to get a model per-geo.

Example:

  x <- tibble(
    geo_value = rep(c("ca", "pa"), 100),
    time_value = as.Date("2021-01-01") + floor(seq(0,100, by = .5))[1:200],
    case_rate = sqrt(1:200) + atan(0.1 * 1:200) + sin(5 * 1:200) + 1,
    death_rate = atan(0.1 * 1:200) + cos(5 * 1:200) + 1
  ) %>%
  as_epi_df(as_of = as.POSIXct("2024-05-17")) %>%
    group_by(geo_value)
  r <- epi_recipe(x) %>%
    step_epi_lag(case_rate, lag = c(0,2,4)) %>%
    step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>%
    update_role(case_rate, new_role = "predictor") %>%
    add_role(all_of(epi_keys(x)), new_role = "predictor")
  latest <- get_test_data(epi_recipe(x), x)
  f <- frosting() %>%
    layer_predict() %>%
    layer_residual_quantiles() %>%
    layer_add_forecast_date() %>%
    layer_add_target_date() %>%
    layer_threshold(starts_with(".pred"))

  eng <- linear_reg()
  wf <- epi_workflow(r, eng, f) %>% fit(x)
  wf$fit$fit

returns

> > > parsnip model object
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
    (Intercept)       time_value      geo_valuepa        case_rate  
     -490.44576          0.02651          0.01119          1.43981  
lag_0_case_rate  lag_2_case_rate  lag_4_case_rate  
             NA         -1.22106          0.45517  
dsweber2 commented 1 month ago

after discussion with @lcbrooks, this was mostly caused by some confusion on my part. Dropping add_role(all_of(epi_keys(x)), new_role = "predictor") makes it no longer include a dummy predictor for geo_value