CDCgov / ww-inference-model

An in-development R package and a Bayesian hierarchical model jointly fitting multiple "local" wastewater data streams and "global" case count data to produce nowcasts and forecasts of both observations
https://cdcgov.github.io/ww-inference-model/
Apache License 2.0
16 stars 2 forks source link

Reconsider output structure of function `get_draws_df()` #147

Closed kaitejohnson closed 1 month ago

kaitejohnson commented 1 month ago

Goal

We currently use the get_draws_df() function to get out a tidy long form dataframe that contains all of the output types presumably of interest to the user: hospital admissions, site-lab level wastewater concentrations, site-level R(t) estimates, global R(t) estimates.

This requires binding together a bunch of the outputs from tidy_bayes::spread_draws() joined to the relevant data. However, I think there are two issues that make this not ideal:

  1. It might not be clear to the user what all the outputs are, and they might assume it is a table of just predicted hospital admissions. Thanks @akeyel for pointing this out
  2. It requires carrying around a bunch of extra metadata columns (for things like site, lab, subpopulation) on global variables that are unnecessary. Currently these are set to NA when this is the case, but it makes for something a bit messy.

Draft solution [2024-09-09]

  1. The output from get_draws_df() will now be a list of class wwinference_fit_draws (S3 class).
  2. The function get_draws_df() will be replaced by get_draws().
  3. The function will have an optional argument called what that will receive the strings "all", and whatever other elements can be extracted via tidy_bayes::spread_draws().
  4. (optional) get_draws_df() will still work but will be an alias to get_draws() using the .Deprecated() function call; this will throw a warning message to the users indicating the function get_draws_df() will be removed in the next version of the model.
  5. The proposed methods for the new class will be plot() and print(), with the plot() method having a what= argument that will specify which bits of the list to plot using the existing plotting functions.
  6. wwinference_fit_draws will also have a method as.data.frame() to turn the list into a data frame.
  7. A reference to this new feature will be added to the vignette.
  8. A test will be added to ensure the function performs as expected when calling what="all".

Proposed solution

@dylanhmorris suggested, and I think I now agree, that it would probably make more sense to split all these tidy draws dataframes up into a list of dataframes as the default output. We could provide an argument to the function that allows the user to specify which output type they want if they don't want to pull the extra dataframes into memory. Downstream functions (currently this is just plotting functions) would work similarly as now but would need to call the correct dataframe in the list.

Another alternative would be to keep the long dataframe format, but again have the get_draws_df() argument allow the user to specify the output types they want returned.

Curious to hear others thoughts on this @dylanhmorris @gvegayon @seabbs @akeyel

gvegayon commented 1 month ago

I like @dylanhmorris' suggestion. Like we are doing with the inference object, we could have additional S3 classes (e.g., wwinference_fit_draws) with a print method (so it is made evident what the object has) and a plot method. That said, maybe the _df suffix may not be appropriate anymore?

kaitejohnson commented 1 month ago

That makes sense to me, @gvegayon. I think we could have the plot method plot the 4 outputs we already created plotting functions for (hospital admissions, wastewater concentrations, subpopulation R(t) estimates, and global R(t) estimates).

And yes, agree that we would want to rename to something like extract_posterior_draws()? I was thinking it would be best to generate all of the output dataframes as a list by default, and and in the vignette include these variables so it is clear to the user that they don't need to get these all in return if they don't want them.

gvegayon commented 1 month ago

Hey @dylanhmorris and @kaitejohnson, I just updated the first comment of the issue with a draft solution. LMK what are your thoughts before I jump into coding!

gvegayon commented 1 month ago

Also, the Deprecated function is more critical in projects with many active users. We are not there yet, but I think (a) it is a good practice in general, and (b) it is a good opportunity to learn how to use it. The "R packages" book has a nice note on it here.

dylanhmorris commented 1 month ago

Your spec looks good to me @gvegayon. I think a full plot method might not be needed.

The function will have an optional argument called what that will receive the strings "all", and whatever other elements can be extracted via tidy_bayes::spread_draws().

To confirm, you're proposing to take in (A) the names of parameters for which to get draws, not (B) actual expressions that can be passed as the expression argument to tidybayes::spread_draws(), yes?

kaitejohnson commented 1 month ago

@gvegayon This looks great to me. My question is the same as @dylanhmorris's but I think I have a strong opinion that it should expect the human readable language that we document somewhere e.g. "predicted hospital admissions", "subpopulation R(t)", etc rather than the stan latent variables.... @dylanhmorris what do you think? Just think this will be slightly cleaner from a user side experience.

I agree that the plot method isn't top priority, but you could just show in the vignette how you would pass each df to the corresponding plot function.

gvegayon commented 1 month ago

I was thinking to pass a human-readable expression, e.g., pass what="subpopulation R(t)" instead of a tidybayes expression. I can leave the plot method as an optional feat, but I don't think it will take too much effort to get it done. If that sounds good to you, I can start working on it

kaitejohnson commented 1 month ago

This sounds great to me @gvegayon thanks so much!