Set up performance benchmarking

kaitejohnson commented 1 month ago

Goal

We want to be sure that any changes we are making to the model, now via the package, are either rendering forecast performance the same or improving it. This will help us make evidence-based decisions on model changes, including both structural model changes and changes to priors etc.

Requirements

[ ] for the full pipeline run, we should summarize the following outputs: average CRPS across all target horizons (days), forecast dates, locations + average CRPS for each forecast date across all locations and target horizons + average CRPS for each location across all forecast dates and target horizons
[ ] for every change to the model, we should benchmark the same outputs on a subset of the data e.g. a few states and a few forecast dates that represent different epidemic phases. This should still run in azure, but it should be quicker and less computationally expensive (and ideally eventually, be automated with self-hosted runners integrated with azure as @gvegayon is working on).

My current proposal for the first version of this is simply that we track these values in git, similar to what we had set up when the entire pipeline was being run on the VAP. We could just create a folder in output called benchmarking and set this up, with a table with a tag of the wwinference commit hash, version, this commit hash, and the scores as described above. Any model changes would require the subset to run, and in general every time we run the full eval pipeline we would want to keep track of how the performance was changing. It could either be a table that keeps growing or one that we replace each time in a PR (don't have a strong preference here).

Curious others thoughts on this @dylanhmorris @seabbs @damonbayer @gvegayon

I would propose we do this before tweaking priors on current model, as well as consider merging into old-prod (aka yesterday's prod) to see how our performance shifted before using wwinference (and whether we should seriously reconsider a restructure back to the original implementation).

damonbayer commented 1 month ago

We can hack something decent together here short term, but we probably need to be looking at more general solutions to this type of problem. It is important for all of our models (and other teams' models) and we don't want to have to invent things from scratch all the time.

I think the problem is basically captured by the idea of "model versioning and experimentation." There are tools for this that we can explore. I have started investigating here: https://github.com/CDCgov/pyrenew-hew/issues/46

gvegayon commented 1 month ago

For reference, we can set up a job in R CMD check of CDCgov/ww-inference-model that runs some analyses as a way to check: (a) compatibility with existing codebase and (b) do the evaluation. Here I have an example of my R package epiworldR. The example checks the current version (CRAN) of epiworldRShiny (which is a package that lists epiworldR as a dependency) and the GitHub version here.

CDCgov / wastewater-informed-covid-forecasting

Set up performance benchmarking #186

Goal

Requirements