cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 9 forks source link

Carefully consider the metadata for the `epi_df` resulting from sliding a forecaster over an `epi_archive` #213

Open rachlobay opened 2 years ago

rachlobay commented 2 years ago

As originally stated in Issue #208, after making the change to allow epix_slide to epi_df so that we can slide a forecaster using arx_forecaster() over an epi_archive, it is good to consider whether the resulting metadata (namely, as_of) is what we want.

In the below example, as_of is related to the first fc_time_values 2020-08-01… but do we really want that? Does it make the most sense to use here?

library(epipredict)
library(epiprocess)
library(covidcast)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)

y <- covidcast_signals(
  c("doctor-visits", "jhu-csse"),
  c("smoothed_adj_cli", "confirmed_7dav_incidence_prop"),
  start_day = "2020-06-01",
  end_day = "2021-12-01",
  issues = c("2020-06-01", "2021-12-01"),
  geo_type = "state",
  geo_values = c("ca", "fl"))

z <- y[[1]] %>%
  select(geo_value, time_value, version = issue, percent_cli = value) %>%
  as_epi_archive()

z <- epix_merge(
  z, y[[2]] %>%
    select(geo_value, time_value, version = issue, case_rate = value) %>%
    as_epi_archive(), sync = "locf")

fc_time_values <- seq(as.Date("2020-08-01"), as.Date("2021-12-01"),
                      by = "1 month")
ahead = 7

# Slide arx_forecaster over archive z
z %>% 
  epix_slide(function(x, ...) 
    arx_forecaster(x, outcome = "case_rate", 
                   predictors = "case_rate",
                   args_list = arx_args_list(ahead = ahead))$predictions %>% 
      select(-c(geo_value, time_value)),
    n = 120, ref_time_values = fc_time_values, new_col_name = "fc") 
brookslogan commented 2 years ago

as_of [of the final result, if that's what we are talking about here] should probably be set to z$versions_end [or we shouldn't output an epi_df at all! "right" class is probably a generalized archive object that doesn't require a geo_value or time_value column]. (This is the last version we "observed", and basically acts like the archive's "version". In most cases this is equivalent to max(z$DT$version); when it's different, it will be greater than this max value and express that we observed additional versions that contained no updates.)

[~On the other hand, maybe it should be max(ref_time_values), which would be automatic if we were to make rbind, bind_rows, etc. on epi_dfs use the max as_of rather than the first one. Current ?dplyr::dplyr_extending may not make this easy.~ While we do want this fix as well, the output for this function should probably be based on versions_end, because if, say versions_end overwrote/clobbered some version data from max(ref_time_values), it'd be more appropriate to label the output as_of using the former rather than the latter. --- Counterpoint: user may use a single ref_time_value at a time, and expect as_of of the output to be that ref_time_value, not the versions_end. And default slide range should go through versions_end, so there's no conflict between the two for the default, and the user might understand that things will change if they change the ref_time_values. But we also need to consider all_versions = TRUE, where things might change. These questions hint that an epi_df might not be the right output class for a multi-ref_time_value epix_slide.]

brookslogan commented 2 years ago

If we are talking about the input to the group computations, then yes, it should be what we already get out of $as_of; the corresponding ref time value / fc time value / version. [This is in line with simulating production forecasting at previous dates, assuming that on forecast time value fctv, we were able to observe version fctv for all data sources involved, and then made a forecast based on the set of targets defined for forecast time value fctv. There could be some more special situations, but thinking about how to handle them seems like a separate issue.]

brookslogan commented 1 year ago

This is related to the discussion of what the output class and column names of epix_slide operations should be (#163, #49, #259, ...). Before we consider those larger changes, I plan to make an incremental improvement / temporary decision by assigning the output (if it's an epi_df) an as_of equal to the input archive's versions_end.

brookslogan commented 1 year ago

(We might also consider a special value of as_of indicating that it mixes data from multiple versions. Again, relates to #242.)

brookslogan commented 5 months ago

--- CURRENT STATE: ---