epiverse-trace / epichains

Methods for simulating and analysing the sizes and lengths of infectious disease transmission chains from branching process models
https://epiverse-trace.github.io/epichains/
Other
6 stars 2 forks source link

Fix disparities in meaning of `stat_max` in various functions #193

Closed jamesmbaazam closed 5 months ago

jamesmbaazam commented 9 months ago

The definition of stat_max in simulate_chains() is not entirely correct and in particular, the part that says "Results above stat_max are set to stat_max." I think this is a remnant from the original definition in bpmodels::chain_sim().

stat_max in simulate_chains() is now the upper bound of the tracked statistic above which chain simulations are halted. The default of Inf means that no upper bound is used. We could add further explanation and guidance to say that it is used internally to ensure that outbreaks with R0 > 1 do not continue forever, so when a user sets an offspring mean above 1 coupled with an infinite population, they need to choose a finite value. Furthermore, stat_max in simulate_chains() should probably be renamed to something like chain_sim_cutoff to remove confusion between it and stat_max in simulate_summary() and likelihood(), which is the value above which stat is set to Inf in the former and set to stat_max in the latter. Renaming it in simulate_chains() will resolve #163.

On the above point, we need to consider using a single definition & operation that can be inherited between simulate_summary() and likelihood() as these varying definitions might be a pointer to design considerations that need to be improved.

@sbfnk Methodologically, what would be wrong with doing stat_track[stat_track >= stat_max] <- stat_max? in simulate_summry() below?

https://github.com/epiverse-trace/epichains/blob/0941451525900125a3dade95b1f8c317d71a4043/R/simulate.r#L414

to align with what's done in likelihood()?

https://github.com/epiverse-trace/epichains/blob/0941451525900125a3dade95b1f8c317d71a4043/R/likelihood.R#L97

sbfnk commented 8 months ago

In likelihood, it's part of a censoring limit, and a way to include the possibility of R>1 in the inference. See Blumberg & Lloyd-Smith, 2013 section 2.4:

To create a general model that incorporates the possibility of non-extinction when R0 > 1, a more formal approach would be to include an arbitrarily large censoring limit in the likelihood calculation for the maximum observable chain size.". Setting the chains with >=stat_max to stat_max is done internally convenience to later allow identification of those chains, but the user never sees that this happens.

In simulate, its meaning is the same (so I think it makes sense to have the same argument) but now the user is exposed to the results - they're set to Inf to mean "this has been censored" but of course the same could perhaps be achieved by setting them to stat_max, which the user has to set after all.

One way or the other, the Issue highlights that the documentation is not clear, which we should try to address. Perhaps it would make sense to leave the code as is but say in the doc that any outbreaks of size greater than stat_max (or a renamed argument) will be treated as if it were of infinite size, and that this represents a censoring limit when doing a likelihood calculation?