CDCgov / cfa-gam-rt

R package for real-time Rt estimation with penalized splines
https://cdcgov.github.io/cfa-gam-rt/
Apache License 2.0
8 stars 0 forks source link

07 data schema #13

Closed zsusswein closed 1 month ago

zsusswein commented 1 month ago

This PR takes the stub codebase and replaces placeholders some initial structure. It implements

The errors are classed and pass the calling environment to make the error messages more user-friendly. This follows the guidance in rlang.

The eventual user flow will use RtGam::RtGam() as a user-facing S3 class constructor with internal helpers and validators. There will be an additional dataset constructor that does input value checking (eg., no repeated dates within groups). I originally intended to put that constructor in this PR, but I decided it made more sense to pause for feedback here in a bite-sized chunk.

In particular, feedback on (1) whether there's any required data input missing and (2) the format of the data input (strict on what's required, vector per item) will be helpful.

Closes #7

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (ff136bf) to head (a890538).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #13 +/- ## ============================================ + Coverage 0.00% 100.00% +100.00% ============================================ Files 1 3 +2 Lines 1 87 +86 ============================================ + Hits 0 87 +87 + Misses 1 0 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

zsusswein commented 1 month ago

Alright I've spent the last week going in circles on the feedback (1) around dates and (2) on formatting inputs as a dataframe vs independent vectors. Here's where I landed:

  1. Dates: Requiring an associated datestamp for each incident case count adds an additional overhead on the user but also allows for greater flexibility. The user has to provide the additional vector of dates (or timestamps) and make sure that these line up with the case vector. It would be more convenient to only provide cases and groups. But I think this flexibility doesn't work well in the multi-group case, which is a priority of this package. If dates/timestamps aren't provided, then each group's case vector will need to be the same length, with NA padding to ensure that everything is in the right place. I think this NA padding is tricky and actually adds more overhead over simply providing dates. I want to make sure multiple groups (hierarchical modeling) is supported as a first-class workflow and the preprocessing to do so is kept to a minimum.
    • For the dates vs. timestamps question, I think it's better to handle the conversion from a date to a timestamp internally rather than pushing it off to the user. As I mentioned above, doing the date to timestamp conversion can be tricky to do well and I want to handle it for the user.
  2. Dataframe vs. vector: The expected preprocessing flow is that a user prepares a dataframe with cases, dates, and groups. This flow will be detailed in a vignette in #3. However, I still think it's worth it to have the user supply the individual columns of this dataframe as vectors (e.g., RtGam(df[["cases"]], df[["dates"]]) rather than the dataframe. It's convenient to avoid requiring specific column names, but I think the real benefit is the explictness and clarity for the users and in the code. By passing vectors, it's clear what's required, where it's supplied, and we avoid the temptation to have mutable objects or in-place changes to dataframes. I think {EpiEstim} benefits from this clarity -- it takes a simple incidence vector and the use is very clear. Likewise in python, statsmodels takes the outcome as a distinct vector from the predictors (but does allow predictors as a matrix). I like this clarity in the API and would like to keep it for now unless things become quite unwieldy.

tl;dr I've gone back and forth, but I want keep things as-is for now. Thanks to @kaitejohnson for an outside perspective on these choices.

Thoughts @seabbs? Are you good merging as-is?

seabbs commented 1 month ago

I'm happy to see this merged and happy that a lot of thought has gone into this.

supply the individual columns of this dataframe as vectors (e.g., RtGam(df[["cases"]], df[["dates"]]) rather than the dataframe

However, I don't really agree with this design decision and find the arguments in favour of it quite weak. In particular, calling out EpiEstim as having clarity in its design language seems problematic to me as people regularly complain about this to me. My particular concern is that it locks you in to having a specific set of vectors always which means if you did want to later support a formula interface etc it would be quite hard.

zsusswein commented 1 month ago

In interest of keeping things moving towards a 0.1 release, I'm going to keep things as-is for now and move this conversation to an issue. I'll assign the issue to the 0.2 release.