Count-based evaluation - Githubissues

afmagee42 commented 3 months ago

This PR adds the ability to evaluate forecasts based on the predictive distribution of counts of sequences to the existing infrastructure for evaluation based on the posterior distribution on frequencies/proportions.

This is accomplished in three parts.

New functions (linmod.eval.generate_eval_counts(), linmod.models.predict_counts(), and linmod.utils.expand_phi()) have been added to enable the generation of posterior predictive count distributions.
The existing evaluation functions have been refactored such that we can feed in either posterior samples of proportions or posterior predictive samples of counts. As they are no longer proportion-specific, I also removed proportions_ from function names as appropriate.
The pipeline script retrospective-forecasting/main.py has been changed accordingly.

To minimize code duplication moving forward, I also refactored how the per-division-day scoring gets aggregated to an overall score. In particular, I removed the middle-man functions proportions_mean_norm() and proportions_energy_score() in favor of linmod.eval.score() which takes in a per-division-day scoring function and does the summation across division-days. Alternative aggregation schemes can now be implemented once in this function, instead of once per scoring function.

The refactoring in (2) passes the pytest tests that were set up (after ensuring the tests use the correct functions).

afmagee42 commented 3 months ago

@thanasibakis one question for you, who have been doing a good job at reducing config bloat: do we actually want the num_post_pred_samps option in the config? In hind-sight this may have been more useful for debugging (sampling 2000 times from the predictive distribution takes time) than in practice, and we could assume we want one draw per posterior sample.

swo commented 2 months ago

My big question is: what's the difference between the count-based scoring here, versus a count-based scoring that takes a weighted average of the proportion-based scores for each unit of analysis (geography & date), with the weights being the counts?

Is it mathematically equivalent (modulo some constant)? Even if this weighting isn't exactly mathematically equivalent, is it good enough? It's easier to reason about, at least, than the thing where we draw counts from the model.

(My understanding is that the current approach is to draw counts from a multinomial, with multinomial category proportions drawn from the posterior of $\phi$. If I've gotten that wrong, I might be misunderstanding other things.)

I think there are some ways that this idea could be generalized, with some different architecture, but I'll save those for #51 and #54. Like, if we did want to keep scoring truly on counts, then I would ask the model to generate the counts, rather than ask for proportions, and then generate counts from those proportions, assuming some out-of-"model" distribution for how those counts are produced.

I can also suggest some line-level improvements that I think will help clarity, but I don't think those are useful in light of the above.

CDCgov / cfa-viral-lineage-model

Count-based evaluation #50