Open dshemetov opened 5 months ago
On formats for epix_slide
to use: I've constructed some dummy examples below, plus >=1 way per format to extract predictions and get in submission-ish format (primary use case), or to extract coefficients in a long format, as if we were planning to debug by ggplotting over forecast dates what the coefficients were (maybe faceted by ahead?).
library(tidyverse)
library(epiprocess)
#>
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#>
#> filter
library(epipredict)
#> Loading required package: parsnip
#>
#> Attaching package: 'epipredict'
#> The following object is masked from 'package:ggplot2':
#>
#> layer
# NOTE from ?arx_forecaster
jhu <- case_death_rate_subset %>%
dplyr::filter(time_value >= as.Date("2021-12-01"))
out <- arx_forecaster(
jhu, "death_rate",
c("case_rate", "death_rate")
)
# Here are potential formats we could get out of `epix_slide()`:
deep <- tibble(
time_value = 1:3, # XXX actually, version...
slide_value = rep(list(out), 3)
)
long <- deep %>%
rowwise() %>%
reframe(time_value, enframe(unclass(slide_value)))
wide <- long %>%
pivot_wider(id_cols = time_value)
wide_packed <- wide %>%
pack(slide_value = c(predictions, epi_workflow, metadata))
deep
#> # A tibble: 3 × 2
#> time_value slide_value
#> <int> <list>
#> 1 1 <arx_fcst>
#> 2 2 <arx_fcst>
#> 3 3 <arx_fcst>
long
#> # A tibble: 9 × 3
#> time_value name value
#> <int> <chr> <list>
#> 1 1 predictions <tibble [56 × 5]>
#> 2 1 epi_workflow <ep_wrkfl>
#> 3 1 metadata <named list [2]>
#> 4 2 predictions <tibble [56 × 5]>
#> 5 2 epi_workflow <ep_wrkfl>
#> 6 2 metadata <named list [2]>
#> 7 3 predictions <tibble [56 × 5]>
#> 8 3 epi_workflow <ep_wrkfl>
#> 9 3 metadata <named list [2]>
# NOTE there could also be an alternative long format using a single named list column, but I have no idea how to work with that.
wide
#> # A tibble: 3 × 4
#> time_value predictions epi_workflow metadata
#> <int> <list> <list> <list>
#> 1 1 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
#> 2 2 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
#> 3 3 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
wide_packed
#> # A tibble: 3 × 2
#> time_value slide_value$predictions $epi_workflow $metadata
#> <int> <list> <list> <list>
#> 1 1 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
#> 2 2 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
#> 3 3 <tibble [56 × 5]> <ep_wrkfl> <named list [2]>
# Here's one way for each that we could use to get to submission-ish format:
preds <-
deep %>%
mutate(predictions = map(slide_value, "predictions"),
slide_value = NULL) %>%
unnest(predictions)
# (This one (`deep`) took a bit of time to think of. I think I knew the solution
# above was possible but ugly, and was hoping to think of a cleaner solution but
# failed and finally fell back on the above.)
# or
# deep %>%
# mutate(slide_value = map(slide_value, unclass)) %>%
# hoist(slide_value, "predictions") %>%
# select(-slide_value) %>%
# unnest(predictions)
# or
# deep %>%
# mutate(predictions = map(slide_value, "predictions"),
# .keep = "unused") %>%
# unnest(predictions)
preds <-
long %>%
filter(name == "predictions") %>%
select(-name) %>%
unnest(value)
preds <-
wide %>%
select(time_value, predictions) %>%
unnest(predictions)
preds <-
wide_packed %>%
mutate(predictions = slide_value$predictions,
slide_value = NULL) %>%
unnest(predictions)
# or
# preds <-
# wide_packed %>%
# rowwise() %>%
# reframe(time_value, slide_value$predictions[[1L]])
# or ...
# Here's how we might extract coefficients in a long format:
get_coefs <- function(fit_ewf) {
fit_ewf %>% workflows::extract_fit_engine() %>% coef()
}
coefs <-
deep %>%
rowwise() %>%
reframe(time_value, slide_value$epi_workflow %>% get_coefs() %>% enframe())
coefs <-
long %>%
filter(name == "epi_workflow") %>%
mutate(coefs = map(value, ~ get_coefs(.x) %>% enframe())) %>%
unnest(coefs, names_sep = "_")
# or, in a better output format:
# coefs <-
# long %>%
# filter(name == "epi_workflow") %>%
# mutate(name = NULL,
# coefs = map(value, ~ get_coefs(.x) %>% enframe()),
# value = NULL) %>%
# unnest(coefs)
coefs <-
wide %>%
rowwise() %>%
reframe(time_value, epi_workflow %>% get_coefs() %>% enframe())
coefs <-
wide_packed %>%
rowwise() %>%
reframe(time_value, slide_value$epi_workflow[[1]] %>% get_coefs() %>% enframe())
# (This one (`wide_packed`) took a bit of time to think of.)
# or
# coefs <-
# wide_packed %>%
# transmute(time_value, coefs = map(slide_value$epi_workflow, ~ get_coefs(.x) %>% enframe())) %>%
# unnest(coefs)
# or ....
Created on 2024-06-06 with reprex v2.0.2
Comment N+1: Non-long
[and long
] formats may cause people to tend to duplicate the original slide_value
column/columns many times when unnesting, if they aren't careful to select out those columns before unnesting. This could potentially lead to confusion at all the duplicated complex objects, and, depending on how serialization & deserialization works, bloated disk or RAM usage when saving & loading these objects. (RAM size before saving shouldn't be an issue; these duplicates should be aliasing memory properly.)
long
format on the coefs, or maybe it was on purpose imagining what level of patience users might have.]Comment N+2: a lot of these felt awkward, trading off between rowwise()
requiring manually attaching the time_value
(actually, the version
/ref_time_value
) vs. group_by(time_value)
requiring slide_value[[1]]
. I don't like the former because it's arcane, and don't like the latter because it's ugly and also would silently drop data if I forgot an ID column (and there seems to be no simple unwrap()
function in R for unwrapping length-1 lists & erroring on other lengths).
Comment N+3: we could also change the output of arx_forecaster()
in other ways. I think we mentioned just returning predictions with the other stuff as attributes, which didn't seem robust to rbinding etc. But we could also potentially have it be more of a tibble-like thing with its own convenience functions for extracting a submission-ish format or extracting coefficients, etc. One drawback would be that if you're not doing a slide you want more scalar-like behavior without having to insert [[1]]'s everywhere... preserving that is probably substantial trouble / boilerplate.
Unfortunately, accepting named lists on their own will likely not fully resolve the issue Ryan and Richard were running into: we want for slide computations to be able to just output an arx_forecaster()
result. I mean, it will probably, but then if you try to pull out $predictions
(in unnested/unpacked format), you'll get naming conflicts with any grouping columns and time_value
columns, since epix_slide()
also directly outputs these. The time_value
part is easy to resolve: we call it version
instead and are more accurate and convenient. The grouping columns part is gnarlier, especially since we don't control unnest.tibble
/ unnest.data.frame
or whatever will be called [if it were directly an issue with the group computation outputs, we can do some special logic to make sure these columns agree and toss the duplicates]. [Workaround is removing/renaming before and/or after unnesting.]
Long format will also upset epi_slide()
size checks unless we dance around it, though it's not recommended for forecasting anyway.
Status: slide improvements pass has specialized epi_slide
, reworked how data frame outputs are processed, and allows list output.
epi_slide
specialization makes epi_slide_reframe
less likely to serve a purpose; it would have been used for simultaneous geo aggregation for example, but we currently forbid having multiple time series in the same group.Remaining:
return(list(scalar))
.
Related issue: https://github.com/cmu-delphi/epiprocess/issues/234
Logan and I did some brainstorming about what use cases
epi_slide
is trying to cover. We found that it doesn't behave gracefully when the slide computation functionf
returns data.frame/tibble outputs (which is something may occur if you're trying to, say, make a forecast per ref_time_value). This led us to thinking that perhaps we shouldn't even recommend usingepi_slide
for forecasting use cases, perhapsepix_slide
should be the goto function for that (where an epi_df can be quickly converted to a fake archive withversion = time_value
). This in turn even led us to suspect that maybeepi_slide
computation functionsf
should be limited to functions that return atomic vectors, like those covered byepi_slide_opt
.TODO
epi_slide_reframe
orepi_slide_mutate
, to imply and constrain the outputs off
to the userSide-note: formats for epix_slide to use when the slide function
f
returns data.frame/tibble outputsThree possible output formats
The deep format is simple from the epi_slide writer's POV, but maybe difficult for the user to index into. The wide format will have issues with name collisions (e.g. if the output has geo_value and time_value as well).