CDCgov / ww-inference-model

An in-development R package and a Bayesian hierarchical model jointly fitting multiple "local" wastewater data streams and "global" case count data to produce nowcasts and forecasts of both observations
https://cdcgov.github.io/ww-inference-model/
Apache License 2.0
16 stars 2 forks source link

Fix check for required wastewater columns #127

Closed kaitejohnson closed 1 month ago

kaitejohnson commented 1 month ago

This is aimed at addressing #98. It does the following:

Unfortunately, I wasn't able to recreate the problem originally described by @akeyel

Here's a reprex of what currently happens if I exclude site:

library(wwinference)
 x <- tibble::tibble(
      date = lubridate::ymd("2024-01-01"),
      lab = 2,
      log_genome_copies_per_ml = 4,
      lod = 6,
      site_pop = 100,
      loc = "example"
    )
    conc_col_name <- "log_genome_copies_per_ml"
    lod_col_name <- "lod"
     assert_req_ww_cols_present(
        x,
        conc_col_name,
        lod_col_name
      )

Error:
! Required columns are missing from the wastewater data.
Names must include the elements {'log_genome_copies_per_ml','lod','date','site','lab','site_pop'}, but is
missing elements site}

@akeyel Would you be able to share a reprex to generate the error you were seeing when using this function?

akeyel commented 1 month ago

@kaitejohnson, your reprex runs and provides the expected error on my computer as well.

I'm not sure what is causing the unexpected error, so haven't figured out how to make a reprex. Below is an example error I get, which goes away when the site and lab columns are added.

I tried your reprex with a data frame instead of a tibble as an input, but that was not the problem. I tried it with additional junk columns with NA values, but that also was not the problem. I tried adding the column names from my data set to your reprex, with NA values, but that also did not generate the error. I also tried adding a second row to the reprex, to see if it was a dimensionality thing, but that ran as expected also. So I'm not sure what in the larger data set is triggering it.

image

kaitejohnson commented 1 month ago

Thanks for sharing this! This is super helpful @akeyel

I suspect that the error is coming from the way that we are pasting together the error messaging, though I am very confused as to why it would differ from one dataset to another...

We use the checkmate package to check that required column names (passed to the check_names() function as a vector of strings) are in the dataset. This either returns TRUE, or an error message with {} that indicates the columns that are missing. Which is a really helpful error message, but the brackets are problematic!

When passing that to cli::cli_abort() it uses glue under the hood, and glue treats brackets as special characters (it expects them to be replaced by variables). So we wrote this utility function, autoescape_brackets() that replaces the { with {{ which I believe should allow it to be rendered properly...

To see if this is the problem, I am going to try removing the checkmate error message in this PR. Would you mind reinstalling the package on this branch and then testing your dataset? Or you could send over a fake dataset and I can test myself! Not sure why the reprex isn't reproducing it!

kaitejohnson commented 1 month ago

Ok after seeing @dylanhmorris's suggestion in #98 I think the fix is probably just to get rid of the brackets.

I rewrote the autoescape_brackets() function to replace brackets with an empty string. When I take the ww_data and remove the site column:

ww_data <- wwinference::ww_data
ww_data_test <- ww_data |>dplyr::select(-site)
assert_req_ww_cols_present(ww_data_test, conc_col_name = "log_genome_copies_per_ml", lod_col_name = "log_lod")
Error: ! Required columns are missing from the wastewater data.
Names must include the elements 'log_genome_copies_per_ml','log_lod','date','site','lab','site_pop', but is missing elements 'site'