Open brookslogan opened 6 months ago
Can you make a simple example without the slide?
Still need to make a simple example. @rnayebi21 ran into this in a different form, with the no non-missing arguments to max; returning -Inf
warning but a singular design matrix error rather than 0 non-NA cases. [For some reason I changed my mind and thought this belonged in #333 for some reason, but since it's dealing with training, it must belong here, or in a separate issue. Trying to get a MWE.]
And my example from #333 seems to belong here....
I've boiled it down a bit and removed the slide. We can probably just focus on forecast_date2
.
library(magrittr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:magrittr':
#>
#> extract
library(purrr)
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:magrittr':
#>
#> set_names
library(ggplot2)
library(epidatr)
#> ! epidatr cache is being used (set env var EPIDATR_USE_CACHE=FALSE if not
#> intended).
#> ℹ The cache directory is ~/.cache/R/epidatr.
#> ℹ The cache will be cleared after 14 days and will be pruned if it exceeds 4096
#> MB.
#> ℹ The log of cache transactions is stored at ~/.cache/R/epidatr/logfile.txt.
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
#>
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#>
#> filter
library(workflows)
library(epipredict)
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#> method from
#> print.step_naomit recipes
#>
#> Attaching package: 'epipredict'
#> The following objects are masked from 'package:workflows':
#>
#> add_model, remove_model, update_model
#> The following object is masked from 'package:ggplot2':
#>
#> layer
analysis_issue <- as.Date("2019-08-01") %>%
MMWRweek::MMWRweek() %>%
{.$MMWRyear * 100L + .$MMWRweek}
issue_data <-
pub_flusurv(
locations = "network_all",
issues = epirange(123401, analysis_issue)
)
#> Warning: Loading from the cache at /home/fullname/.cache/R/epidatr; see
#> ~/.cache/R/epidatr/logfile.txt for more details.
#> This warning is displayed once every 8 hours.
archive <- issue_data %>%
transmute(geo_value = location,
time_value = epiweek + 6L,
version = release_date,
pick(starts_with("rate_"))) %>%
as_epi_archive(compactify = TRUE)
ahead_forecaster <- function(edf, ahead) {
edf %>%
arx_forecaster(
outcome = "rate_overall",
predictors = "rate_overall",
args_list = arx_args_list(
ahead = ahead
)
)
}
horizon_from_forecast_date_forecaster <- function(edf, horizon_from_forecast_date) {
forecast_date <- attr(edf, "metadata")$as_of
shared_reporting_latency <- as.integer(forecast_date - max(edf$time_value))
ahead <- horizon_from_forecast_date + shared_reporting_latency
ahead_forecaster(edf, ahead)
}
horizon_from_reference_date_forecaster <- function(edf, horizon_from_reference_date) {
forecast_date <- attr(edf, "metadata")$as_of
reference_date <- forecast_date - as.POSIXlt(forecast_date)$wday + 6L # Sat
target_date <- reference_date + horizon_from_reference_date
max_time_value <- max(edf$time_value)
ahead <- as.integer(target_date - max_time_value)
ahead_forecaster(edf, ahead)
}
forecast_date1 <- as.Date("2018-11-05")
edf1 <- archive %>% epix_as_of(forecast_date1)
# forecast_date1 is challenging (and must not have been part of the forecast
# comparison); flusurv still hadn't yet reached the threshold to start releasing
# reporting for the 2018/2019 season:
max(edf1$time_value)
#> [1] "2018-04-28"
forecast_date1 - max(edf1$time_value)
#> Time difference of 191 days
format(forecast_date1, "%a")
#> [1] "Mon"
ahead_forecaster(edf1, 7L)$predictions$target_date # -> irrelevant target dates
#> [1] "2018-05-05"
ahead_forecaster(edf1, 3L)$predictions$target_date # invalid ahead
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_forecast_date_forecaster(edf1, 0L)$predictions$target_date # invalid ahead, another way
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_reference_date_forecaster(edf1, -7L)$predictions$target_date # (-> likely-low-quality forecasts due to high ahead)
#> [1] "2018-11-03"
forecast_date2 <- as.Date("2018-12-17")
edf2 <- archive %>% epix_as_of(forecast_date2)
# forecast_date2 is more realistic, but the hard errors don't change:
max(edf2$time_value)
#> [1] "2018-12-08"
forecast_date2 - max(edf2$time_value)
#> Time difference of 9 days
format(forecast_date2, "%a")
#> [1] "Mon"
fc2a <- ahead_forecaster(edf2, 7L)
ahead_forecaster(edf2, 3L) # invalid ahead
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
horizon_from_forecast_date_forecaster(edf2, 0L) # invalid ahead, another way
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
fc2d <- horizon_from_reference_date_forecaster(edf2, -7L)
waldo::compare(fc2a$predictions, fc2d$predictions)
#> ✔ No differences
Created on 2024-07-24 with reprex v2.1.1
Focusing on forecast_date2
, I'm not sure what you're expecting to get here. The as_of
is 9 days after the maximum time value, but the data is weekly. So it can't possibly create the correct leads/lags and the correct forecast_date/target_date pair simultaneously automatically. And there's not going to be a valid combination of things that are 3 days ahead with lags that are 7 days behind.
The problem is that you have latent data, not that you have a bad method.
You get exactly the same error with:
edf3 <- edf2[0, 3:8]
lm(rate_overall ~ ., data = edf3)
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
Shouldn't you just adjust the ahead?
library(dplyr)
library(tidyr)
library(purrr)
library(epidatr)
library(epiprocess)
library(epipredict)
analysis_issue <- as.Date("2019-08-01") %>%
MMWRweek::MMWRweek() %>%
{.$MMWRyear * 100L + .$MMWRweek}
issue_data <-
pub_flusurv(
locations = "network_all",
issues = epirange(123401, analysis_issue)
)
archive <- issue_data %>%
transmute(geo_value = location,
time_value = epiweek + 6L,
version = release_date,
pick(starts_with("rate_"))) %>%
as_epi_archive(compactify = TRUE)
forecast_date2 <- as.Date("2018-12-17")
edf2 <- archive %>% epix_as_of(forecast_date2)
out <- arx_forecaster(
edf2, "rate_overall", "rate_overall",
args_list = arx_args_list(ahead = 14L)
)
out$predictions
#> # A tibble: 1 × 5
#> geo_value .pred .pred_distn forecast_date target_date
#> <chr> <dbl> <dist> <date> <date>
#> 1 network_all 0.600 quantiles(0.91)[2] 2018-12-08 2018-12-22
Created on 2024-07-24 with reprex v2.1.1
Note:
There is potentially a bug here in the get_test_data()
logic which uses epiprocess:::guess_period()
. Because the data is weekly, but the forecast date doesn't fit, it tries to turn it into daily data before padding with NA's. See pad_to_end()
.
Focusing on forecast_date2, I'm not sure what you're expecting to get here. The as_of is 9 days after the maximum time value, but the data is weekly. So it can't possibly create the correct leads/lags and the correct forecast_date/target_date pair simultaneously automatically. And there's not going to be a valid combination of things that are 3 days ahead with lags that are 7 days behind.
I'm hoping to get an error message better than #> ! 0 (non-NA) cases
. Something like your explanation above as an error message (in bake.step_epi_shift()
?) would be amazing. [Or, alternatively, maybe make lags and aheads be in terms of number of periods (see below).]
There is potentially a bug here in the get_test_data() logic which uses epiprocess:::guess_period().
@dshemetov has done some updates to how time types work; we may want to avoid using guess_period()
and rely on the period implied by the time_type
/ guess_time_type()
. I'm also considering modifying or removing the other usage of it in epix_slide_ref_time_values_default()
as well, since it can have similar inferred-daily things happen in situations I wouldn't consider really user error.
Here are additional errors / lack thereof that @rnayebi21 ran into. (The simplified example below makes it look very easy to realize without a clear error message, but it's not so simple when doing backtesting, especially where there's unexpectedly part of the data set missing in some snapshots, as was the case when he ran into it. Or if you're relying on default lags and aheads probably.)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
#>
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#>
#> filter
library(epipredict)
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#> method from
#> print.step_naomit recipes
edf <- tibble(
geo_value = "ut",
time_value = as.Date("2020-1-01") + 1:15 - 1L,
x = seq_along(time_value) + rnorm(length(time_value)),
y = seq_along(time_value) + rnorm(length(time_value)),
) %>%
as_epi_df()
edf %>% arx_forecaster("y")
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
erec <- epi_recipe(edf) %>%
step_epi_lag(y, lag = c(0L, 7L)) %>%
## step_lag_difference(x, horizon = 7L) %>% # actually not needed for error
step_epi_ahead(y, ahead = 7L)
f <- frosting()
ewf1 <- epi_workflow(erec, linear_reg(), f)
ewf2 <- epi_workflow(erec, quantile_reg(), f)
# Probably should be a hard error (and not sure why there are so many time values):
ewf1 %>% fit(edf) %>% forecast()
#> Warning in predict.lm(object = object$fit, newdata = new_data, type =
#> "response", : prediction from rank-deficient fit; consider predict(.,
#> rankdeficient="NA")
#> An `epi_df` object, 22 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-07-25 15:51:31.237749
#>
#> # A tibble: 22 × 3
#> geo_value time_value .pred
#> * <chr> <date> <dbl>
#> 1 ut 2020-01-01 15.0
#> 2 ut 2020-01-02 15.0
#> 3 ut 2020-01-03 15.0
#> 4 ut 2020-01-04 15.0
#> 5 ut 2020-01-05 15.0
#> 6 ut 2020-01-06 15.0
#> 7 ut 2020-01-07 15.0
#> 8 ut 2020-01-08 15.0
#> 9 ut 2020-01-09 15.0
#> 10 ut 2020-01-10 15.0
#> # ℹ 12 more rows
# We want this to have a more informative error:
ewf2 %>% fit(edf) %>% forecast()
#> Error in rq.fit.br(x, y, tau = tau, ...): Singular design matrix
# (though above wouldn't work with more data for another reason, at `forecast`
# time: Assertion on 'quantile_levels' failed: Must be of type 'numeric', not
# 'NULL'.)
Created on 2024-07-25 with reprex v2.1.1
[fixed the example... somehow reprex generated different results than I saw in-session initially? It looks like it was the same amount of data, not sure what was different. Above error with singular matrix is the one we want to improve]
There are 2 issues here (and a partial fix for the second):
First issue
The forecast_date
, ahead
, lag
mismatch. This is actually very complicated and intersects with epiprocess::guess_period()
. I think the right thing to do is to guess the period of the training data immediately, then check that aheads and lags conform in recipe creation. But the guess_period()
difftime
result doesn’t allow for math, and wouldn’t help in this case. (A 1 week period becomes a 1 so checking it against lead/lag = 7 would always fail.)
Possible fixes (maybe others):
step_epi_ahead/lag
to accept things like "1 week"
or similar {lubridate}
periods. Not sure that’s a fix in this case though because the training data is in "days" even though the period is weekly so, best case, this would be a breaking change such that things like 7
to mean "7 days or 1 week" would fail.guess_period()
to detect equivalences and support math: we want to know if something like ahead = 7
is compatible with period = 1 week
.Second issue
You have too little training data and your trainer doesn't like collinearity. I personally don't believe that we should be trying to catch trainer errors, but I'm happy to review a PR if you do.
For the training data, there is a check for this. Would you prefer it on by default? If so, perhaps someone could fix in a PR. Here's your most recent example with the check turned on:
library(dplyr)
library(epiprocess)
library(epipredict)
edf <- tibble(
geo_value = "ut",
time_value = as.Date("2020-1-01") + 1:15 - 1L,
x = seq_along(time_value) + rnorm(length(time_value)),
y = seq_along(time_value) + rnorm(length(time_value)),
) %>%
as_epi_df()
edf %>% arx_forecaster("y", args_list = arx_args_list(check_enough_data_n = 21L))
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y,
#> lag_7_y, lag_14_y, and y.
erec <- epi_recipe(edf) %>%
step_epi_lag(y, lag = c(0L, 7L)) %>%
step_epi_ahead(y, ahead = 7L) %>%
check_enough_train_data(recipes::all_predictors(), n = 21L) # 7 ahead and 14 lags
f <- frosting()
ewf1 <- epi_workflow(erec, linear_reg(), f)
ewf2 <- epi_workflow(erec, quantile_reg(), f)
ewf1 %>% fit(edf) %>% forecast()
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y and
#> lag_7_y.
ewf2 %>% fit(edf) %>% forecast()
#> Error in `prep()` at epipredict/R/epi_recipe.R:489:7:
#> ! The following columns don't have enough data to predict: lag_0_y and
#> lag_7_y.
Created on 2024-07-29 with reprex v2.1.1
I think something that might help this (running into 0 (non-NA) cases
in a completely different context) is having bake.step_epi_ahead
and/or bake.step_epi_lag
check that there's always at least one non-NA
row, and if there isn't, output the tibble and steps applied so far. This wouldn't handle every insufficient training data situation, but it would cover quite a few.ff
When working with weekly data, it's easy to mess up the specification of lagsets and aheadsets, but we only get a a confusing/imprecise error message about 0 non-NA cases instead, and it seems quite challenging to debug errors through steps and layers, especially with S3 involved. (https://github.com/cmu-delphi/epiprocess/issues/342 is also relevant here, although since the error happens right off the bat, the current way of just letting errors pass through allows more debugging tools such as
recover()
, though it's not really helpful in this case.)Created on 2024-05-14 with reprex v2.0.2
The logic needed is probably a bit more complicated than one might think, since the aheadset and lagset don't necessarily include 0; you can't just do some pre-lag-calculation check of
all(is.na( <something> ))
ornot(all( <x> %in% <y> ))
. I suspect, e.g., forstep_epi_lag
that this requires checking (A --- that none of the predictors coming in, potentially including some not being shifted, wasn't already all NAs, to prevent confusing error messages from B ---) that the output lagged signals, other (unshifted) predictors, (maybe other things with roles?,) and, when training, the outcomes, have at least some overlapping non-NA rows. And messaging about it helpfully is probably even harder; maybe there could be some output of a section of the output df merged with the original df (to include original versions of the shifted signals) including at least one non-NA for each shifted output, if there are any (think A + our lagging method should ensure this?), so they can see where they don't line up?Whether this should be a warning or an error probably depends on whether this is recoverable via other steps.