CDCgov / wastewater-informed-covid-forecasting

Wastewater-informed COVID-19 forecasting models submitted to the COVID-19 Forecast Hub
https://cdcgov.github.io/wastewater-informed-covid-forecasting/
Apache License 2.0
44 stars 8 forks source link

Set up performance benchmarking #186

Closed kaitejohnson closed 1 month ago

kaitejohnson commented 1 month ago

Goal

We want to be sure that any changes we are making to the model, now via the package, are either rendering forecast performance the same or improving it. This will help us make evidence-based decisions on model changes, including both structural model changes and changes to priors etc.

Requirements

My current proposal for the first version of this is simply that we track these values in git, similar to what we had set up when the entire pipeline was being run on the VAP. We could just create a folder in output called benchmarking and set this up, with a table with a tag of the wwinference commit hash, version, this commit hash, and the scores as described above. Any model changes would require the subset to run, and in general every time we run the full eval pipeline we would want to keep track of how the performance was changing. It could either be a table that keeps growing or one that we replace each time in a PR (don't have a strong preference here).

Curious others thoughts on this @dylanhmorris @seabbs @damonbayer @gvegayon

I would propose we do this before tweaking priors on current model, as well as consider merging into old-prod (aka yesterday's prod) to see how our performance shifted before using wwinference (and whether we should seriously reconsider a restructure back to the original implementation).

damonbayer commented 1 month ago

We can hack something decent together here short term, but we probably need to be looking at more general solutions to this type of problem. It is important for all of our models (and other teams' models) and we don't want to have to invent things from scratch all the time.

I think the problem is basically captured by the idea of "model versioning and experimentation." There are tools for this that we can explore. I have started investigating here: https://github.com/CDCgov/pyrenew-hew/issues/46

gvegayon commented 1 month ago

For reference, we can set up a job in R CMD check of CDCgov/ww-inference-model that runs some analyses as a way to check: (a) compatibility with existing codebase and (b) do the evaluation. Here I have an example of my R package epiworldR. The example checks the current version (CRAN) of epiworldRShiny (which is a package that lists epiworldR as a dependency) and the GitHub version here.