cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 8 forks source link

Future redesign of `epi_slide` and `epix_slide` #458

Open dshemetov opened 5 months ago

dshemetov commented 5 months ago

Related issue: https://github.com/cmu-delphi/epiprocess/issues/234

Logan and I did some brainstorming about what use cases epi_slide is trying to cover. We found that it doesn't behave gracefully when the slide computation function f returns data.frame/tibble outputs (which is something may occur if you're trying to, say, make a forecast per ref_time_value). This led us to thinking that perhaps we shouldn't even recommend using epi_slide for forecasting use cases, perhaps epix_slide should be the goto function for that (where an epi_df can be quickly converted to a fake archive with version = time_value). This in turn even led us to suspect that maybe epi_slide computation functions f should be limited to functions that return atomic vectors, like those covered by epi_slide_opt.

TODO

Side-note: formats for epix_slide to use when the slide function f returns data.frame/tibble outputs

Three possible output formats

# Deep
geo | time | slide_value (list)
----------------------------
a        1        list(m = , w = , p = )
a        2        list(m = , w = , p = )
a        3        list(m = , w = , p = )

# Wide
geo | time | metadata (list) | workflow (list) | preds (list)
-------------------------------------------------------
a        1        ...
a        2        ...
a        3        ...

# Long
geo | time | name              |  value (list)
-------------------------------------------------------
a        1        "metadata"        < >
a        2        "workflow"
a        3        "preds"

The deep format is simple from the epi_slide writer's POV, but maybe difficult for the user to index into. The wide format will have issues with name collisions (e.g. if the output has geo_value and time_value as well).

brookslogan commented 5 months ago

On formats for epix_slide to use: I've constructed some dummy examples below, plus >=1 way per format to extract predictions and get in submission-ish format (primary use case), or to extract coefficients in a long format, as if we were planning to debug by ggplotting over forecast dates what the coefficients were (maybe faceted by ahead?).

library(tidyverse)
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(epipredict)
#> Loading required package: parsnip
#> 
#> Attaching package: 'epipredict'
#> The following object is masked from 'package:ggplot2':
#> 
#>     layer

# NOTE from ?arx_forecaster
jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value >= as.Date("2021-12-01"))
out <- arx_forecaster(
  jhu, "death_rate",
  c("case_rate", "death_rate")
)

# Here are potential formats we could get out of `epix_slide()`:

deep <- tibble(
  time_value = 1:3, # XXX actually, version...
  slide_value = rep(list(out), 3)
)

long <- deep %>%
  rowwise() %>%
  reframe(time_value, enframe(unclass(slide_value)))

wide <- long %>%
  pivot_wider(id_cols = time_value)

wide_packed <- wide %>%
  pack(slide_value = c(predictions, epi_workflow, metadata))

deep
#> # A tibble: 3 × 2
#>   time_value slide_value
#>        <int> <list>     
#> 1          1 <arx_fcst> 
#> 2          2 <arx_fcst> 
#> 3          3 <arx_fcst>
long
#> # A tibble: 9 × 3
#>   time_value name         value            
#>        <int> <chr>        <list>           
#> 1          1 predictions  <tibble [56 × 5]>
#> 2          1 epi_workflow <ep_wrkfl>       
#> 3          1 metadata     <named list [2]> 
#> 4          2 predictions  <tibble [56 × 5]>
#> 5          2 epi_workflow <ep_wrkfl>       
#> 6          2 metadata     <named list [2]> 
#> 7          3 predictions  <tibble [56 × 5]>
#> 8          3 epi_workflow <ep_wrkfl>       
#> 9          3 metadata     <named list [2]>
# NOTE there could also be an alternative long format using a single named list column, but I have no idea how to work with that.
wide
#> # A tibble: 3 × 4
#>   time_value predictions       epi_workflow metadata        
#>        <int> <list>            <list>       <list>          
#> 1          1 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
#> 2          2 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
#> 3          3 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
wide_packed
#> # A tibble: 3 × 2
#>   time_value slide_value$predictions $epi_workflow $metadata       
#>        <int> <list>                  <list>        <list>          
#> 1          1 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
#> 2          2 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
#> 3          3 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
# Here's one way for each that we could use to get to submission-ish format:

preds <-
  deep %>%
  mutate(predictions = map(slide_value, "predictions"),
         slide_value = NULL) %>%
  unnest(predictions)
# (This one (`deep`) took a bit of time to think of. I think I knew the solution
# above was possible but ugly, and was hoping to think of a cleaner solution but
# failed and finally fell back on the above.)

# or

# deep %>%
#   mutate(slide_value = map(slide_value, unclass)) %>%
#   hoist(slide_value, "predictions") %>%
#   select(-slide_value) %>%
#   unnest(predictions)

# or

# deep %>%
#   mutate(predictions = map(slide_value, "predictions"),
#          .keep = "unused") %>%
#   unnest(predictions)

preds <-
  long %>%
  filter(name == "predictions") %>%
  select(-name) %>%
  unnest(value)

preds <-
  wide %>%
  select(time_value, predictions) %>%
  unnest(predictions)

preds <-
  wide_packed %>%
  mutate(predictions = slide_value$predictions,
         slide_value = NULL) %>%
  unnest(predictions)

# or

# preds <-
#   wide_packed %>%
#   rowwise() %>%
#   reframe(time_value, slide_value$predictions[[1L]])

# or ...
# Here's how we might extract coefficients in a long format:

get_coefs <- function(fit_ewf) {
  fit_ewf %>% workflows::extract_fit_engine() %>% coef()
}

coefs <-
  deep %>%
  rowwise() %>%
  reframe(time_value, slide_value$epi_workflow %>% get_coefs() %>% enframe())

coefs <-
  long %>%
  filter(name == "epi_workflow") %>%
  mutate(coefs = map(value, ~ get_coefs(.x) %>% enframe())) %>%
  unnest(coefs, names_sep = "_")

# or, in a better output format:

# coefs <-
#   long %>%
#   filter(name == "epi_workflow") %>%
#   mutate(name = NULL,
#          coefs = map(value, ~ get_coefs(.x) %>% enframe()),
#          value = NULL) %>%
#   unnest(coefs)

coefs <-
  wide %>%
  rowwise() %>%
  reframe(time_value, epi_workflow %>% get_coefs() %>% enframe())

coefs <-
  wide_packed %>%
  rowwise() %>%
  reframe(time_value, slide_value$epi_workflow[[1]] %>% get_coefs() %>% enframe())
# (This one (`wide_packed`) took a bit of time to think of.)

# or

# coefs <-
#   wide_packed %>%
#   transmute(time_value, coefs = map(slide_value$epi_workflow, ~ get_coefs(.x) %>% enframe())) %>%
#   unnest(coefs)

# or ....

Created on 2024-06-06 with reprex v2.0.2

Comment N+1: Non-long [and long] formats may cause people to tend to duplicate the original slide_value column/columns many times when unnesting, if they aren't careful to select out those columns before unnesting. This could potentially lead to confusion at all the duplicated complex objects, and, depending on how serialization & deserialization works, bloated disk or RAM usage when saving & loading these objects. (RAM size before saving shouldn't be an issue; these duplicates should be aliasing memory properly.)

Comment N+2: a lot of these felt awkward, trading off between rowwise() requiring manually attaching the time_value (actually, the version/ref_time_value) vs. group_by(time_value) requiring slide_value[[1]]. I don't like the former because it's arcane, and don't like the latter because it's ugly and also would silently drop data if I forgot an ID column (and there seems to be no simple unwrap() function in R for unwrapping length-1 lists & erroring on other lengths).

Comment N+3: we could also change the output of arx_forecaster() in other ways. I think we mentioned just returning predictions with the other stuff as attributes, which didn't seem robust to rbinding etc. But we could also potentially have it be more of a tibble-like thing with its own convenience functions for extracting a submission-ish format or extracting coefficients, etc. One drawback would be that if you're not doing a slide you want more scalar-like behavior without having to insert [[1]]'s everywhere... preserving that is probably substantial trouble / boilerplate.

brookslogan commented 5 months ago

Unfortunately, accepting named lists on their own will likely not fully resolve the issue Ryan and Richard were running into: we want for slide computations to be able to just output an arx_forecaster() result. I mean, it will probably, but then if you try to pull out $predictions (in unnested/unpacked format), you'll get naming conflicts with any grouping columns and time_value columns, since epix_slide() also directly outputs these. The time_value part is easy to resolve: we call it version instead and are more accurate and convenient. The grouping columns part is gnarlier, especially since we don't control unnest.tibble / unnest.data.frame or whatever will be called [if it were directly an issue with the group computation outputs, we can do some special logic to make sure these columns agree and toss the duplicates]. [Workaround is removing/renaming before and/or after unnesting.]

brookslogan commented 5 months ago

Long format will also upset epi_slide() size checks unless we dance around it, though it's not recommended for forecasting anyway.

brookslogan commented 1 month ago

Status: slide improvements pass has specialized epi_slide, reworked how data frame outputs are processed, and allows list output.

Remaining: