Raise warning/error on bad ahead/lagsets (on training data)

When working with weekly data, it's easy to mess up the specification of lagsets and aheadsets, but we only get a a confusing/imprecise error message about 0 non-NA cases instead, and it seems quite challenging to debug errors through steps and layers, especially with S3 involved. (https://github.com/cmu-delphi/epiprocess/issues/342 is also relevant here, although since the error happens right off the bat, the current way of just letting errors pass through allows more debugging tools such as recover(), though it's not really helpful in this case.)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(ggplot2)
library(epidatr)
#> ! epidatr cache is being used (set env var EPIDATR_USE_CACHE=FALSE if not
#>   intended).
#> ℹ The cache directory is ~/.cache/R/epidatr.
#> ℹ The cache will be cleared after 14 days and will be pruned if it exceeds 4096
#>   MB.
#> ℹ The log of cache transactions is stored at ~/.cache/R/epidatr/logfile.txt.
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(epipredict)
#> Loading required package: parsnip
#> 
#> Attaching package: 'epipredict'
#> The following object is masked from 'package:ggplot2':
#> 
#>     layer

flusurv_analysis_issue <- as.Date("2019-08-01") %>%
  MMWRweek::MMWRweek() %>%
  {.$MMWRyear * 100L + .$MMWRweek}

flusurv_issue_data <-
  pub_flusurv(
    locations = "network_all",
    issues = epirange(123401, flusurv_analysis_issue)
  )
#> Warning: Loading from the cache at /home/fullname/.cache/R/epidatr; see
#> ~/.cache/R/epidatr/logfile.txt for more details.
#> This warning is displayed once every 8 hours.

flusurv_archive <- flusurv_issue_data %>%
  select(geo_value = location,
         time_value = epiweek,
         version = release_date,
         starts_with("rate_")) %>%
  as_epi_archive(compactify = TRUE)

archive <- flusurv_archive

forecast_dates <- seq(min(archive$DT$version) + 120L, archive$versions_end,
                      by = "6 weeks")

horizons <- 1 + c(0, 7, 14, 21, 28) # relative to forecast_date

example_forecaster <- function(snapshot_edf, forecast_date) {
  # shared_reporting_latency <- as.integer(forecast_date - max(snapshot_edf$time_value))
  horizons %>%
    map(function(horizon) {
      snapshot_edf %>%
        arx_forecaster(
          outcome = "rate_overall",
          predictors = "rate_overall",
          args_list = arx_args_list(
            # (this is incomplete; latency often varies signficantly by covariate and can't be ignored, so we also need lag adjustment.)
            ahead = horizon, # <-- oops, forgot latency adjustment
            quantile_levels = c(0.1, 0.5, 0.9),
            forecast_date = forecast_date,
            target_date = forecast_date + horizon
          )) %>%
        .$predictions
    }) %>%
    bind_rows()
  ## list()
}

pseudoprospective_forecasts <-
  archive %>%
  epix_slide(
    ref_time_values = forecast_dates,
    before = 365000L, # 1000-year time window --> don't filter out any `time_value`s
    ~ example_forecaster(.x, .ref_time_value),
    names_sep = NULL
  ) %>%
  select(-time_value)
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in `map()` at rlang/R/dots.R:91:3:
#> ℹ In index: 1.
#> Caused by error in `lm.fit()`:
#> ! 0 (non-NA) cases
#> Backtrace:
#>      ▆
#>   1. ├─... %>% select(-time_value)
#>   2. ├─dplyr::select(., -time_value)
#>   3. ├─epiprocess::epix_slide(...) at dplyr/R/select.R:54:3
#>   4. │ └─x$slide(...)
#>   5. │   ├─... %>% ungroup()
#>   6. │   └─self$group_by()$slide(...) at dplyr/R/group-by.R:153:3
#>   7. │     └─base::lapply(...)
#>   8. │       └─epiprocess (local) FUN(X[[i]], ...)
#>   9. │         ├─dplyr::group_modify(...)
#>  10. │         ├─epiprocess:::group_modify.epi_df(...) at dplyr/R/group-map.R:156:3
#>  11. │         │ └─dplyr::dplyr_reconstruct(NextMethod(), .data)
#>  12. │         │   └─dplyr:::dplyr_new_data_frame(data) at dplyr/R/generics.R:196:3
#>  13. │         │     ├─row.names %||% .row_names_info(x, type = 0L) at dplyr/R/utils.R:18:3
#>  14. │         │     └─base::.row_names_info(x, type = 0L) at dplyr/R/utils.R:18:3
#>  15. │         ├─base::NextMethod()
#>  16. │         └─dplyr:::group_modify.data.frame(...)
#>  17. │           └─epiprocess (local) .f(.data, group_keys(.data), ...) at dplyr/R/group-map.R:166:3
#>  18. │             └─f(.data_group, .group_key, ref_time_value, ...)
#>  19. │               └─global example_forecaster(.x, .ref_time_value)
#>  20. │                 └─... %>% bind_rows()
#>  21. ├─dplyr::ungroup(.)
#>  22. ├─dplyr::bind_rows(.)
#>  23. │ └─rlang::list2(...) at dplyr/R/bind-rows.R:31:3
#>  24. ├─purrr::map(...) at rlang/R/dots.R:91:3
#>  25. │ └─purrr:::map_("list", .x, .f, ..., .progress = .progress) at purrr/R/map.R:129:3
#>  26. │   ├─purrr:::with_indexed_errors(...) at purrr/R/map.R:174:3
#>  27. │   │ └─base::withCallingHandlers(...) at purrr/R/map.R:201:3
#>  28. │   ├─purrr:::call_with_cleanup(...) at purrr/R/map.R:174:3
#>  29. │   └─.f(.x[[i]], ...)
#>  30. │     └─... %>% .$predictions
#>  31. ├─epipredict::arx_forecaster(...)
#>  32. │ ├─generics::fit(wf, epi_data)
#>  33. │ ├─epipredict:::fit.epi_workflow(wf, epi_data)
#>  34. │ ├─base::NextMethod()
#>  35. │ └─workflows:::fit.workflow(wf, epi_data)
#>  36. │   └─workflows::.fit_model(workflow, control)
#>  37. │     ├─generics::fit(action_model, workflow = workflow, control = control)
#>  38. │     └─workflows:::fit.action_model(...)
#>  39. │       └─workflows:::fit_from_xy(spec, mold, case_weights, control_parsnip)
#>  40. │         ├─generics::fit_xy(...)
#>  41. │         └─parsnip::fit_xy.model_spec(...)
#>  42. │           └─parsnip:::xy_form(...)
#>  43. │             └─parsnip:::form_form(...)
#>  44. │               └─parsnip:::eval_mod(...)
#>  45. │                 └─rlang::eval_tidy(e, env = envir, ...)
#>  46. ├─stats::lm(formula = ..y ~ ., data = data) at rlang/R/eval-tidy.R:121:3
#>  47. │ └─stats::lm.fit(...)
#>  48. │   └─base::stop("0 (non-NA) cases")
#>  49. └─base::.handleSimpleError(...)
#>  50.   └─purrr (local) h(simpleError(msg, call))
#>  51.     └─cli::cli_abort(...) at purrr/R/map.R:215:9
#>  52.       └─rlang::abort(...) at cli/R/rlang.R:45:3

^{Created on 2024-05-14 with reprex v2.0.2}

The logic needed is probably a bit more complicated than one might think, since the aheadset and lagset don't necessarily include 0; you can't just do some pre-lag-calculation check of all(is.na( <something> )) or not(all( <x> %in% <y> )). I suspect, e.g., for step_epi_lag that this requires checking (A --- that none of the predictors coming in, potentially including some not being shifted, wasn't already all NAs, to prevent confusing error messages from B ---) that the output lagged signals, other (unshifted) predictors, (maybe other things with roles?,) and, when training, the outcomes, have at least some overlapping non-NA rows. And messaging about it helpfully is probably even harder; maybe there could be some output of a section of the output df merged with the original df (to include original versions of the shifted signals) including at least one non-NA for each shifted output, if there are any (think A + our lagging method should ensure this?), so they can see where they don't line up?

Whether this should be a warning or an error probably depends on whether this is recoverable via other steps.

Can you make a simple example without the slide?

Still need to make a simple example. @rnayebi21 ran into this in a different form, with the no non-missing arguments to max; returning -Inf warning but a singular design matrix error rather than 0 non-NA cases. [For some reason I changed my mind and thought this belonged in #333 for some reason, but since it's dealing with training, it must belong here, or in a separate issue. Trying to get a MWE.]

And my example from #333 seems to belong here....

I've boiled it down a bit and removed the slide. We can probably just focus on forecast_date2.

library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:magrittr':
#> 
#>     extract
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:magrittr':
#> 
#>     set_names
library(ggplot2)
library(epidatr)
#> ! epidatr cache is being used (set env var EPIDATR_USE_CACHE=FALSE if not
#>   intended).
#> ℹ The cache directory is ~/.cache/R/epidatr.
#> ℹ The cache will be cleared after 14 days and will be pruned if it exceeds 4096
#>   MB.
#> ℹ The log of cache transactions is stored at ~/.cache/R/epidatr/logfile.txt.
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(workflows)
library(epipredict)
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#>   method            from   
#>   print.step_naomit recipes
#> 
#> Attaching package: 'epipredict'
#> The following objects are masked from 'package:workflows':
#> 
#>     add_model, remove_model, update_model
#> The following object is masked from 'package:ggplot2':
#> 
#>     layer

analysis_issue <- as.Date("2019-08-01") %>%
  MMWRweek::MMWRweek() %>%
  {.$MMWRyear * 100L + .$MMWRweek}

issue_data <-
  pub_flusurv(
    locations = "network_all",
    issues = epirange(123401, analysis_issue)
  )
#> Warning: Loading from the cache at /home/fullname/.cache/R/epidatr; see
#> ~/.cache/R/epidatr/logfile.txt for more details.
#> This warning is displayed once every 8 hours.

archive <- issue_data %>%
  transmute(geo_value = location,
            time_value = epiweek + 6L,
            version = release_date,
            pick(starts_with("rate_"))) %>%
  as_epi_archive(compactify = TRUE)

ahead_forecaster <- function(edf, ahead) {
  edf %>%
    arx_forecaster(
      outcome = "rate_overall",
      predictors = "rate_overall",
      args_list = arx_args_list(
        ahead = ahead
      )
    )
}

horizon_from_forecast_date_forecaster <- function(edf, horizon_from_forecast_date) {
  forecast_date <- attr(edf, "metadata")$as_of
  shared_reporting_latency <- as.integer(forecast_date - max(edf$time_value))
  ahead <- horizon_from_forecast_date + shared_reporting_latency
  ahead_forecaster(edf, ahead)
}

horizon_from_reference_date_forecaster <- function(edf, horizon_from_reference_date) {
  forecast_date <- attr(edf, "metadata")$as_of
  reference_date <- forecast_date - as.POSIXlt(forecast_date)$wday + 6L # Sat
  target_date <- reference_date + horizon_from_reference_date
  max_time_value <- max(edf$time_value)
  ahead <- as.integer(target_date - max_time_value)
  ahead_forecaster(edf, ahead)
}

forecast_date1 <- as.Date("2018-11-05")
edf1 <- archive %>% epix_as_of(forecast_date1)

# forecast_date1 is challenging (and must not have been part of the forecast
# comparison); flusurv still hadn't yet reached the threshold to start releasing
# reporting for the 2018/2019 season:
max(edf1$time_value)
#> [1] "2018-04-28"
forecast_date1 - max(edf1$time_value)
#> Time difference of 191 days
format(forecast_date1, "%a")
#> [1] "Mon"

ahead_forecaster(edf1, 7L)$predictions$target_date # -> irrelevant target dates
#> [1] "2018-05-05"
ahead_forecaster(edf1, 3L)$predictions$target_date # invalid ahead
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_forecast_date_forecaster(edf1, 0L)$predictions$target_date # invalid ahead, another way
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_reference_date_forecaster(edf1, -7L)$predictions$target_date # (-> likely-low-quality forecasts due to high ahead)
#> [1] "2018-11-03"

forecast_date2 <- as.Date("2018-12-17")
edf2 <- archive %>% epix_as_of(forecast_date2)

# forecast_date2 is more realistic, but the hard errors don't change:
max(edf2$time_value)
#> [1] "2018-12-08"
forecast_date2 - max(edf2$time_value)
#> Time difference of 9 days
format(forecast_date2, "%a")
#> [1] "Mon"

fc2a <- ahead_forecaster(edf2, 7L)
ahead_forecaster(edf2, 3L) # invalid ahead
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_forecast_date_forecaster(edf2, 0L) # invalid ahead, another way
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
fc2d <- horizon_from_reference_date_forecaster(edf2, -7L)

waldo::compare(fc2a$predictions, fc2d$predictions)
#> ✔ No differences

^{Created on 2024-07-24 with reprex v2.1.1}

Focusing on forecast_date2, I'm not sure what you're expecting to get here. The as_of is 9 days after the maximum time value, but the data is weekly. So it can't possibly create the correct leads/lags and the correct forecast_date/target_date pair simultaneously automatically. And there's not going to be a valid combination of things that are 3 days ahead with lags that are 7 days behind.

The problem is that you have latent data, not that you have a bad method.

You get exactly the same error with:

edf3 <- edf2[0, 3:8]
lm(rate_overall ~ ., data = edf3)
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases

Shouldn't you just adjust the ahead?

library(dplyr)
library(tidyr)
library(purrr)
library(epidatr)
library(epiprocess)
library(epipredict)

analysis_issue <- as.Date("2019-08-01") %>%
  MMWRweek::MMWRweek() %>%
  {.$MMWRyear * 100L + .$MMWRweek}

issue_data <-
  pub_flusurv(
    locations = "network_all",
    issues = epirange(123401, analysis_issue)
  )

archive <- issue_data %>%
  transmute(geo_value = location,
            time_value = epiweek + 6L,
            version = release_date,
            pick(starts_with("rate_"))) %>%
  as_epi_archive(compactify = TRUE)

forecast_date2 <- as.Date("2018-12-17")
edf2 <- archive %>% epix_as_of(forecast_date2)

out <- arx_forecaster(
  edf2, "rate_overall", "rate_overall", 
  args_list = arx_args_list(ahead = 14L)
)
out$predictions
#> # A tibble: 1 × 5
#>   geo_value   .pred        .pred_distn forecast_date target_date
#>   <chr>       <dbl>             <dist> <date>        <date>     
#> 1 network_all 0.600 quantiles(0.91)[2] 2018-12-08    2018-12-22

^{Created on 2024-07-24 with reprex v2.1.1}

Note:

There is potentially a bug here in the get_test_data() logic which uses epiprocess:::guess_period(). Because the data is weekly, but the forecast date doesn't fit, it tries to turn it into daily data before padding with NA's. See pad_to_end().

Focusing on forecast_date2, I'm not sure what you're expecting to get here. The as_of is 9 days after the maximum time value, but the data is weekly. So it can't possibly create the correct leads/lags and the correct forecast_date/target_date pair simultaneously automatically. And there's not going to be a valid combination of things that are 3 days ahead with lags that are 7 days behind.

I'm hoping to get an error message better than #> ! 0 (non-NA) cases. Something like your explanation above as an error message (in bake.step_epi_shift()?) would be amazing. [Or, alternatively, maybe make lags and aheads be in terms of number of periods (see below).]

There is potentially a bug here in the get_test_data() logic which uses epiprocess:::guess_period().

@dshemetov has done some updates to how time types work; we may want to avoid using guess_period() and rely on the period implied by the time_type / guess_time_type(). I'm also considering modifying or removing the other usage of it in epix_slide_ref_time_values_default() as well, since it can have similar inferred-daily things happen in situations I wouldn't consider really user error.

Here are additional errors / lack thereof that @rnayebi21 ran into. (The simplified example below makes it look very easy to realize without a clear error message, but it's not so simple when doing backtesting, especially where there's unexpectedly part of the data set missing in some snapshots, as was the case when he ran into it. Or if you're relying on default lags and aheads probably.)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(epipredict)
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#>   method            from   
#>   print.step_naomit recipes

edf <- tibble(
  geo_value = "ut",
  time_value = as.Date("2020-1-01") + 1:15 - 1L,
  x = seq_along(time_value) + rnorm(length(time_value)),
  y = seq_along(time_value) + rnorm(length(time_value)),
  ) %>%
  as_epi_df()

edf %>% arx_forecaster("y")
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases

erec <- epi_recipe(edf) %>%
  step_epi_lag(y, lag = c(0L, 7L)) %>%
  ## step_lag_difference(x, horizon = 7L) %>% # actually not needed for error
  step_epi_ahead(y, ahead = 7L)

f <- frosting()

ewf1 <- epi_workflow(erec, linear_reg(), f)
ewf2 <- epi_workflow(erec, quantile_reg(), f)

# Probably should be a hard error (and not sure why there are so many time values):
ewf1 %>% fit(edf) %>% forecast()
#> Warning in predict.lm(object = object$fit, newdata = new_data, type =
#> "response", : prediction from rank-deficient fit; consider predict(.,
#> rankdeficient="NA")
#> An `epi_df` object, 22 x 3 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-07-25 15:51:31.237749
#> 
#> # A tibble: 22 × 3
#>    geo_value time_value .pred
#>  * <chr>     <date>     <dbl>
#>  1 ut        2020-01-01  15.0
#>  2 ut        2020-01-02  15.0
#>  3 ut        2020-01-03  15.0
#>  4 ut        2020-01-04  15.0
#>  5 ut        2020-01-05  15.0
#>  6 ut        2020-01-06  15.0
#>  7 ut        2020-01-07  15.0
#>  8 ut        2020-01-08  15.0
#>  9 ut        2020-01-09  15.0
#> 10 ut        2020-01-10  15.0
#> # ℹ 12 more rows
# We want this to have a more informative error:
ewf2 %>% fit(edf) %>% forecast()
#> Error in rq.fit.br(x, y, tau = tau, ...): Singular design matrix
# (though above wouldn't work with more data for another reason, at `forecast`
# time: Assertion on 'quantile_levels' failed: Must be of type 'numeric', not
# 'NULL'.)

^{Created on 2024-07-25 with reprex v2.1.1}

[fixed the example... somehow reprex generated different results than I saw in-session initially? It looks like it was the same amount of data, not sure what was different. Above error with singular matrix is the one we want to improve]

There are 2 issues here (and a partial fix for the second):

First issue

The forecast_date, ahead, lag mismatch. This is actually very complicated and intersects with epiprocess::guess_period(). I think the right thing to do is to guess the period of the training data immediately, then check that aheads and lags conform in recipe creation. But the guess_period() difftime result doesn’t allow for math, and wouldn’t help in this case. (A 1 week period becomes a 1 so checking it against lead/lag = 7 would always fail.)

Possible fixes (maybe others):

There’s an old issue about allowing step_epi_ahead/lag to accept things like "1 week" or similar {lubridate} periods. Not sure that’s a fix in this case though because the training data is in "days" even though the period is weekly so, best case, this would be a breaking change such that things like 7 to mean "7 days or 1 week" would fail.
Add functionality or change the output of guess_period() to detect equivalences and support math: we want to know if something like ahead = 7 is compatible with period = 1 week.

Second issue

You have too little training data and your trainer doesn't like collinearity. I personally don't believe that we should be trying to catch trainer errors, but I'm happy to review a PR if you do.

For the training data, there is a check for this. Would you prefer it on by default? If so, perhaps someone could fix in a PR. Here's your most recent example with the check turned on:

library(dplyr)
library(epiprocess)
library(epipredict)

edf <- tibble(
  geo_value = "ut",
  time_value = as.Date("2020-1-01") + 1:15 - 1L,
  x = seq_along(time_value) + rnorm(length(time_value)),
  y = seq_along(time_value) + rnorm(length(time_value)),
) %>%
  as_epi_df()

edf %>% arx_forecaster("y", args_list = arx_args_list(check_enough_data_n = 21L))
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y,
#>   lag_7_y, lag_14_y, and y.
erec <- epi_recipe(edf) %>%
  step_epi_lag(y, lag = c(0L, 7L)) %>%
  step_epi_ahead(y, ahead = 7L) %>%
  check_enough_train_data(recipes::all_predictors(), n = 21L) # 7 ahead and 14 lags

f <- frosting()

ewf1 <- epi_workflow(erec, linear_reg(), f)
ewf2 <- epi_workflow(erec, quantile_reg(), f)

ewf1 %>% fit(edf) %>% forecast()
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y and
#>   lag_7_y.
ewf2 %>% fit(edf) %>% forecast()
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y and
#>   lag_7_y.

^{Created on 2024-07-29 with reprex v2.1.1}

I think something that might help this (running into 0 (non-NA) cases in a completely different context) is having bake.step_epi_ahead and/or bake.step_epi_lag check that there's always at least one non-NA row, and if there isn't, output the tibble and steps applied so far. This wouldn't handle every insufficient training data situation, but it would cover quite a few.ff

cmu-delphi / epipredict

Raise warning/error on bad ahead/lagsets (on training data) #332