fix: more bugs in #290 - Githubissues

dshemetov commented 3 months ago

Checklist

Please:

[ ] ~~Make sure this PR is against "dev", not "main".~~
[x] Request a review from one of the current epipredict main reviewers: dajmcdon.
[ ] ~~Makes sure to bump the version number in DESCRIPTION and NEWS.md. Always increment the patch version number (the third number), unless you are making a release PR from dev to main, in which case increment the minor version number (the second number).~~
[x] Describe changes made in NEWS.md, making sure breaking changes (backwards-incompatible changes to the documented interface) are noted. Collect the changes under the next release number (e.g. if you are on 0.7.2, then write your changes under the 0.8 heading).

Change explanations for reviewer

Addresses some additional bugs in #290, merge into #300

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

dshemetov commented 3 months ago

Given today's meeting discussion, I'm going to abandon this argument validation track. I'm going to revert those changes and instead add some messaging to the canned forecasters that make its arguments clearer.

UPDATE: Done, force-pushed.

dshemetov commented 3 months ago

So my addition to layer_residual_quantiles successfully errors when n_training < ahead in flatline_forecaster, but I got stuck when trying to make a non-canned unit test, not sure why residuals keep coming out non-empty in the case below. Can't tell if I'm just misunderstanding epipredict here or my fix in this PR isn't a tight-enough check for n_training < ahead. Thoughts @dajmcdon ?

test_that("error when residuals NA", {
  r <- epi_recipe(jhu) %>%
    step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
    step_epi_ahead(death_rate, ahead = 7) %>%
    step_epi_naomit() %>%
    step_training_window(n_recent = 6)
  f <- frosting() %>%
    layer_residual_quantiles(quantile_levels = c(0.5, 0.95))

  latest <- get_test_data(recipe = r, x = jhu)
  wf <- epi_workflow(r, parsnip::linear_reg(), f)
  wf <- fit(wf, jhu)
  # Does not throw error
  expect_error(predict(wf, latest))
  # Because wf$fit$fit$fit$residuals is not NA
  wf$fit$fit$fit$residuals
})

dajmcdon commented 3 months ago

It's due to the ordering of the steps. In this case, the NA's are omitted first, rather than after limiting the training window. I suspect this is actually "what you want", (but not what you thought you were getting).

jhu <- case_death_rate_subset
ds_r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_epi_naomit() %>%
  step_training_window(n_recent = 6)

other_r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_training_window(n_recent = 6) %>%
  step_epi_naomit()

# Show the training data
bake(prep(ds_r, jhu), new_data = NULL)
#> An `epi_df` object, 336 x 8 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 12:08:25.791826
#> 
#> # A tibble: 336 × 8
#>    time_value geo_value case_rate death_rate lag_0_death_rate lag_7_death_rate
#>  * <date>     <chr>         <dbl>      <dbl>            <dbl>            <dbl>
#>  1 2021-12-19 ak             23.1      1.19             1.19            0.0593
#>  2 2021-12-20 ak             23.2      1.17             1.17            0.0791
#>  3 2021-12-21 ak             23.2      1.17             1.17            0.0791
#>  4 2021-12-22 ak             20.3      1.72             1.72            0.0593
#>  5 2021-12-23 ak             20.3      1.72             1.72            0.0593
#>  6 2021-12-24 ak             12.6      0.593            0.593           1.19  
#>  7 2021-12-19 al             16.2      0.255            0.255           0.232 
#>  8 2021-12-20 al             16.9      0.255            0.255           0.232 
#>  9 2021-12-21 al             15.7      0.241            0.241           0.250 
#> 10 2021-12-22 al             18.1      0.221            0.221           0.293 
#> # ℹ 326 more rows
#> # ℹ 2 more variables: lag_14_death_rate <dbl>, ahead_7_death_rate <dbl>
bake(prep(other_r, jhu), new_data = NULL) # an empty tibble
#> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE):
#> no non-missing arguments to max; returning -Inf
#> An `epi_df` object, 0 x 8 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 12:08:25.791826
#> 
#> # A tibble: 0 × 8
#> # ℹ 8 variables: time_value <date>, geo_value <chr>, case_rate <dbl>,
#> #   death_rate <dbl>, lag_0_death_rate <dbl>, lag_7_death_rate <dbl>,
#> #   lag_14_death_rate <dbl>, ahead_7_death_rate <dbl>

There's another option too, and this is what I was describing: where we limit the training data FIRST:

another_r <- epi_recipe(jhu) %>%
  step_training_window(n_recent = 6) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 7)

bake(prep(another_r, jhu), new_data = NULL) # an empty tibble
#> An `epi_df` object, 1,344 x 8 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 12:08:25.791826
#> 
#> # A tibble: 1,344 × 8
#>    time_value geo_value case_rate death_rate lag_0_death_rate lag_7_death_rate
#>  * <date>     <chr>         <dbl>      <dbl>            <dbl>            <dbl>
#>  1 2021-12-19 ak               NA         NA               NA               NA
#>  2 2021-12-19 al               NA         NA               NA               NA
#>  3 2021-12-19 ar               NA         NA               NA               NA
#>  4 2021-12-19 as               NA         NA               NA               NA
#>  5 2021-12-19 az               NA         NA               NA               NA
#>  6 2021-12-19 ca               NA         NA               NA               NA
#>  7 2021-12-19 co               NA         NA               NA               NA
#>  8 2021-12-19 ct               NA         NA               NA               NA
#>  9 2021-12-19 dc               NA         NA               NA               NA
#> 10 2021-12-19 de               NA         NA               NA               NA
#> # ℹ 1,334 more rows
#> # ℹ 2 more variables: lag_14_death_rate <dbl>, ahead_7_death_rate <dbl>

dshemetov commented 3 months ago

Got it, thank you. So in your case, the train data is actually either empty or full of NA rows, so I'm getting an lm error before we even get to layer_quantile_residuals, so there's no unit test to do there. Any other tests to add here or can I just merge this into your PR?

cmu-delphi / epipredict

fix: more bugs in #290 #304

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch