Replicate Model 2 from cdcgov/wastewater-informed-covid-forecasting

ghost commented 7 months ago

Goal

Have a working Python library that implements the most basic version of the wastewater model. By most basic, we mean (a) hospitalizations only, and (b) single geographical unit (no pooled model).

Context

See #32.

Required features

The model (implemented either as a function or class) should be able to read hospitalization data simulated from the WW model.
It should fit the model using the no-U turn sampler.
Return the estimates.
And make predictions.

Specifications

[x] Create (or bring in from WW) data for testing.
[x] Write a Python script under model/src implementing the model (like this)
[x] Document the model.
[x] Write a Jupyter notebook walking through the process of read/fit/predict (vignette).
[ ] Have a test that compares the estimates of this Python module with the WW R package.

Out of scope

Features beyond the basic model.

Related documents

See #18 for math equations.
See the Stan implementation in the WW R package here

Ref: https://github.com/cdcent/cfa-multisignal-renewal/issues/44 author: @gvegayoncdc

ghost commented 7 months ago

@afmagee42 and @kaitejohnson, could you point me to the data + estimates you use for CI in the WW model 👼?

afmagee42 commented 7 months ago

For our CI on the model, our testing inputs and expected outputs are stored in an .rda file. These are produced in internal_data.R, using package-public data. The internal testing data are stan data objects and single stan output iterations from short runs.

The public example data is generated in example_data.R and live in the package's data folder. These are simulations of semi-processed data (post-pull, pre stan-preprocessing).

(Yes this is a bit messy, blame me and my insistence that users should not have access to purely testing data in the package, because that's why it's this way.)

Pipeline testing/CI is handled at the repo level and currently just checks that the pipeline runs without errors.

afmagee42 commented 7 months ago

We specifically don't use any real data or estimates in order to make sure there's no potential data leak, as this lives in the public repo.

Our first-pass at CI data for checking the stan output was consistent was taken from a posterior-predictive dataset from a real model run.

The testing would have also included (or perhaps just been) the computation of the joint posterior density at some arbitrary parameter values, but cmdstanr won't expose the requisite functions in WSL so we had to evaluate based on MCMC output instead. (I don't like that this compounds both changes in the model and any changes stan makes to the algorithm in one place, but it's what we've got.)

gvegayon commented 6 months ago

PR #55 is completing the testing part.

CDCgov / PyRenew