Input hospital admissions and wastewater data checks

kaitejohnson commented 4 months ago

Here are some validation checks that user-provided input data (for wastewater and hospital admissions) are formatted correctly. The overall goal is to provide really user-friendly error messaging to let the user know when something they have passed in will be incompatible/incoherent with downstream functionality. In the case of this project, we need a very specific structure for the current implementation of the model (e.g. one hospital admissions dataset from a global population, one or more wastewater concentration datasets with lab and site identifiers), so there are pretty strict requirements for how the input data needs to be formatted. The goal here is to fail early and loudly, and tell the user what aspect of their data is causing the failure.

What this does so far: For the wastewater data:

checks we have the required column names
checks that the columns are dates, the values are non-negative, the site and lab identifiers are integers or characters, the site population sizes are integers, checks that there are no missing values (for the ww data, we don't want NA padding) For the hospital admissions data:
checks we have the required column names
checks that the columns are dates, that there is only a single population size for the data, checks that there is only one observation per day, checks that counts are positive and integers

We will want to go through and add more of these, in particular checks that look at both input data sets and say give a warning if the sum of the populations of the sites are greater than the global population.

I relied heavily on @zsusswein's gam package, and then created a bunch of more very specific to this model/project checks as well.

@seabbs @dylanhmorris Tagging you for any thoughts you might have. This is complete but I think closes #4

kaitejohnson commented 4 months ago

Ok @dylanhmorris I made the suggested edits in the validate.R files, using checkmate::check_x() and assertr::fxn() where present. My one question that I have is would it be terribly redundant to instead still wrap these in the hand written check_element() functions, so that for example for the type checkers it will tell you what argument failed the type checker rather than just that it failed, and for the ones that are very specific to our model, we can give more verbose error messages?

For example, while the assertr::not_na() will error if I pass in a vector containing NAs, as written originally, I will get an error and I will get this error message:

c("{.arg {arg}} has missing values",
        "i" = "Missing values are not supported in {.arg {arg}}",
        "!" = "Missing element(s) index: {.val {which(is_missing)}}"
      ),

Whick maybe is unnecessary but is a bit more of an explanation. So for the wastewater concentration column, we would be able to have something that tells them they can't have missing values in their wastewater concentration data and which element its missing in, which seems useful?

Maybe am overthinking here!

dylanhmorris commented 4 months ago

For assertr I would suggest trying to configure things via this functionality: https://docs.ropensci.org/assertr/articles/assertr.html#success-error-and-defect-functions

kaitejohnson commented 4 months ago

For assertr I would suggest trying to configure things via this functionality: https://docs.ropensci.org/assertr/articles/assertr.html#success-error-and-defect-functions

Hmmm I tried a bunch of variations of success_logical() and error_logical() but I don't think I'm understanding what these do other than always return TRUE or FALSE. Did you mean some other functions?

kaitejohnson commented 3 months ago

@dylanhmorris Let me know what you think of the changes I just made.

I removed the dependency on assertr because as is I don't think we are using it to DRY-ify any code (e.g. we aren't using it to string together assertions or more generally apply to an entire dataframe). Instead, we have "check" functions using checkmate and baser/dplyr that do the specific checking on a specified vector, and then we bind wrap these together in the validate functions. I like the current functionality because passing the arg returns a helpful error message that indicates what column in the dataframe is erroring and even what index it is at.

CDCgov / ww-inference-model

Input hospital admissions and wastewater data checks #37