CDCgov / ww-inference-model

An in-development R package and a Bayesian hierarchical model jointly fitting multiple "local" wastewater data streams and "global" case count data to produce nowcasts and forecasts of both observations
https://cdcgov.github.io/ww-inference-model/
Apache License 2.0
16 stars 2 forks source link

Input hospital admissions and wastewater data checks #37

Closed kaitejohnson closed 2 months ago

kaitejohnson commented 2 months ago

Here are some validation checks that user-provided input data (for wastewater and hospital admissions) are formatted correctly. The overall goal is to provide really user-friendly error messaging to let the user know when something they have passed in will be incompatible/incoherent with downstream functionality. In the case of this project, we need a very specific structure for the current implementation of the model (e.g. one hospital admissions dataset from a global population, one or more wastewater concentration datasets with lab and site identifiers), so there are pretty strict requirements for how the input data needs to be formatted. The goal here is to fail early and loudly, and tell the user what aspect of their data is causing the failure.

What this does so far: For the wastewater data:

We will want to go through and add more of these, in particular checks that look at both input data sets and say give a warning if the sum of the populations of the sites are greater than the global population.

I relied heavily on @zsusswein's gam package, and then created a bunch of more very specific to this model/project checks as well.

@seabbs @dylanhmorris Tagging you for any thoughts you might have. This is complete but I think closes #4

kaitejohnson commented 2 months ago

Ok @dylanhmorris I made the suggested edits in the validate.R files, using checkmate::check_x() and assertr::fxn() where present. My one question that I have is would it be terribly redundant to instead still wrap these in the hand written check_element() functions, so that for example for the type checkers it will tell you what argument failed the type checker rather than just that it failed, and for the ones that are very specific to our model, we can give more verbose error messages?

For example, while the assertr::not_na() will error if I pass in a vector containing NAs, as written originally, I will get an error and I will get this error message:

c("{.arg {arg}} has missing values",
        "i" = "Missing values are not supported in {.arg {arg}}",
        "!" = "Missing element(s) index: {.val {which(is_missing)}}"
      ),

Whick maybe is unnecessary but is a bit more of an explanation. So for the wastewater concentration column, we would be able to have something that tells them they can't have missing values in their wastewater concentration data and which element its missing in, which seems useful?

Maybe am overthinking here!

dylanhmorris commented 2 months ago

For assertr I would suggest trying to configure things via this functionality: https://docs.ropensci.org/assertr/articles/assertr.html#success-error-and-defect-functions

kaitejohnson commented 2 months ago

For assertr I would suggest trying to configure things via this functionality: https://docs.ropensci.org/assertr/articles/assertr.html#success-error-and-defect-functions

Hmmm I tried a bunch of variations of success_logical() and error_logical() but I don't think I'm understanding what these do other than always return TRUE or FALSE. Did you mean some other functions?

kaitejohnson commented 2 months ago

@dylanhmorris Let me know what you think of the changes I just made.

I removed the dependency on assertr because as is I don't think we are using it to DRY-ify any code (e.g. we aren't using it to string together assertions or more generally apply to an entire dataframe). Instead, we have "check" functions using checkmate and baser/dplyr that do the specific checking on a specified vector, and then we bind wrap these together in the validate functions. I like the current functionality because passing the arg returns a helpful error message that indicates what column in the dataframe is erroring and even what index it is at.