cmu-delphi / epipredict

Tools for building predictive models in epidemiology.
https://cmu-delphi.github.io/epipredict/
Other
8 stars 8 forks source link

Provide better interface or documentation for per-geo modeling #336

Open brookslogan opened 4 weeks ago

brookslogan commented 4 weeks ago

Context

@dsweber2 was just noting that

epi_df %>%
  epi_slide(
    ~ .x %>%
      group_by(geo_value) %>%
      arx_forecaster(.....)
    )

doesn't fit per-geo models; it actually just ignores the grouping altogether.

We suggested "transposing" the operations, but @rnayebi21 found that

epi_df %>%
  group_by(geo_value) %>%
  epi_slide(~ arx_forecaster(.x, .....), .....)

doesn't work either; .x doesn't have the geo_value column thus lacks epi_dfness. I believe these problems also apply to when you are trying to do version-faithful backtesting with epix_slide().

Workarounds seem a little bit of a pain, either

The first workaround seems more modular (you can have a list of forecasters that can all rely on ungrouped slides, rather than having to do a different type of slide call for each one).

Proposal

  1. Make arx_forecaster() check specifically if there's a missing geo_value and hint that if they were doing a grouped epix_slide() or epi_slide() with geo_value in the group variables, that won't work, and to do <some workaround / feature> instead.
  2. Check if input to arx_forecaster() etc. is grouped; if so, either
    • warn
    • abort
    • fit & forecast one model per group
  3. [Also check for groupedness and warn/abort in the epi workflow internals.]

Musings

We can also probably make things easier epiprocess-side, by adding a .keep parameter if we're not already able to forward to group_modify() via dots. But I'm not sure we actually want to... this makes it easier to use epi_slide() for forecasting when it shouldn't actually be (epix_slide() should be favored and maybe renamed to make this clear).

[@dshemetov points out we should document this geo-grouped epi_slide gotcha in epiprocess. And actually fixing what's going wrong is part of a much larger project, epiprocess#223.]

dajmcdon commented 2 weeks ago

Maybe this was described above, but I'm not quite clear. It sounds like the major issue is that calling group_by() on an epi_df has a strange effect on the geo_value (and whatever you are grouping on) that causes it to behave poorly with the epi_workflow processing. Is that right?