Closed kaitejohnson closed 1 month ago
We can hack something decent together here short term, but we probably need to be looking at more general solutions to this type of problem. It is important for all of our models (and other teams' models) and we don't want to have to invent things from scratch all the time.
I think the problem is basically captured by the idea of "model versioning and experimentation." There are tools for this that we can explore. I have started investigating here: https://github.com/CDCgov/pyrenew-hew/issues/46
For reference, we can set up a job in R CMD check of CDCgov/ww-inference-model that runs some analyses as a way to check: (a) compatibility with existing codebase and (b) do the evaluation. Here I have an example of my R package epiworldR. The example checks the current version (CRAN) of epiworldRShiny (which is a package that lists epiworldR as a dependency) and the GitHub version here.
Goal
We want to be sure that any changes we are making to the model, now via the package, are either rendering forecast performance the same or improving it. This will help us make evidence-based decisions on model changes, including both structural model changes and changes to priors etc.
Requirements
My current proposal for the first version of this is simply that we track these values in git, similar to what we had set up when the entire pipeline was being run on the VAP. We could just create a folder in
output
calledbenchmarking
and set this up, with a table with a tag of thewwinference
commit hash, version, this commit hash, and the scores as described above. Any model changes would require the subset to run, and in general every time we run the full eval pipeline we would want to keep track of how the performance was changing. It could either be a table that keeps growing or one that we replace each time in a PR (don't have a strong preference here).Curious others thoughts on this @dylanhmorris @seabbs @damonbayer @gvegayon
I would propose we do this before tweaking priors on current model, as well as consider merging into
old-prod
(aka yesterday's prod) to see how our performance shifted before usingwwinference
(and whether we should seriously reconsider a restructure back to the original implementation).