Clarify `stat_max` - Githubissues

epiverse-trace / epichains

[Under active development] Methods for simulating and analysing the sizes and lengths of infectious disease transmission chains from branching process models

https://epiverse-trace.github.io/epichains/

Other

5 stars 2 forks source link

Clarify `stat_max` #237

Closed jamesmbaazam closed 3 months ago

jamesmbaazam commented 5 months ago

This PR closes #193 by renaming stat_max to stat_threshold and improves the documentation to help clarify some nuances between stat_threshold in simulate_chains() and simulate_summary(). It also renames the infinite argument (old name of stat_max) in rborel() to censor_at and improves its documentation. In particular,

stat_threshold in simulate_chains() is a stopping criterion for chains. When the cumulative statistic of chains reaches or surpasses it, they end
stat_threshold in simulate_summary() is both a stopping criterion for chains and a censoring limit. When the cumulative statistics of chains reach or surpass it, they end. Additionally, chain statistic values that are >= stat_threshold are set to Inf.
censor_at in the borel functions is passed to stat_threshold in simulate_summary() as a stopping criterion and for censoring chain sizes but does not refer to chain sizes as used in simulate_summary(), hence the name change and dedicated documentation.

jamesmbaazam commented 3 months ago

The need to clarify this argument came up today in the following tutorial. A participant asked, "Why do some chains produce more cumulative cases than what stat_max is?" My answer was roughly, "Chains may still produce more cases than stat_max near the stopping criterion but will not continue to produce more cases after the stopping criterion is met. For example, if the stopping criterion is 10 cases, a chain may produce 12 cases near the stopping criterion." This raises the need to clarify stat_max further as it suggests that chains cannot produce outcomes above it.

@joshwlambert and @avallecam Do you have thoughts on how to clarify this further? I'm tagging you because we had a brief discussion about it after the tutorial session.

joshwlambert commented 3 months ago

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

My preference would be to either 1) keep the name stat_max and make it a hard limit for the simulation so the user cannot exceed this number, or 2) change the argument name to replace max with another word that doesn't imply a hard limit and then improve documentation to how this works.

This is where I think the confusion stemmed from in the tutorial because it is a max that does not behave as a max.

joshwlambert commented 3 months ago

FYI we have the same functionality dilemma in {simulist} and have chosen to have a soft limit so the "maximum" can be exceeded and the function returns a warning to let the user know the "maximum" (either default 1e4 or user-specified) has been exceeded and states the number of cases returned.

https://github.com/epiverse-trace/simulist/blob/main/R/sim_linelist.R#L70

This discussion also makes it clear that the documentation in sim_linelist() could also be improved.

jamesmbaazam commented 3 months ago

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

I think this might be the desirable behaviour. We could treat the last generation of offspring as potential offspring and sample the number needed to reach stat_max by finding the difference between the cumulative number before the last generation and stat_max.

jamesmbaazam commented 3 months ago

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

My preference would be to either 1) keep the name stat_max and make it a hard limit for the simulation so the user cannot exceed this number, or 2) change the argument name to replace max with another word that doesn't imply a hard limit and then improve documentation to how this works.

This is where I think the confusion stemmed from in the tutorial because it is a max that does not behave as a max.

Thanks, Josh.

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

jamesmbaazam commented 3 months ago

FYI we have the same functionality dilemma in {simulist} and have chosen to have a soft limit so the "maximum" can be exceeded and the function returns a warning to let the user know the "maximum" (either default 1e4 or user-specified) has been exceeded and states the number of cases returned.

epiverse-trace/simulist@main/R/sim_linelist.R#L70

This discussion also makes it clear that the documentation in sim_linelist() could also be improved.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

joshwlambert commented 3 months ago

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

Sounds good to me.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

Here's a reprex to show what I mean. The outbreak_size argument takes a vector of two elements a hard lower limit, and a soft upper limit for the outbreak size.

set.seed(2)
library(simulist)
linelist <- sim_linelist(
  contact_distribution = function(x) dpois(x = x, lambda = 2),
  infect_period = function(x) dgamma(x = x, shape = 3, scale = 3),
  prob_infect = 0.55,
  onset_to_hosp = function(x) dgamma(x = x, shape = 2, scale = 2),
  onset_to_death = function(x) dgamma(x = x, shape = 2, scale = 2),
  outbreak_size = c(5, 20)
)
#> Warning: Number of cases exceeds maximum outbreak size. 
#> Returning data early with 29 cases and 44 total contacts (including cases).

^{Created on 2024-05-03 with reprex v2.1.0}

jamesmbaazam commented 3 months ago

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

Sounds good to me.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

Here's a reprex to show what I mean. The outbreak_size argument takes a vector of two elements a hard lower limit, and a soft upper limit for the outbreak size.
set.seed(2)
library(simulist)
linelist <- sim_linelist(
  contact_distribution = function(x) dpois(x = x, lambda = 2),
  infect_period = function(x) dgamma(x = x, shape = 3, scale = 3),
  prob_infect = 0.55,
  onset_to_hosp = function(x) dgamma(x = x, shape = 2, scale = 2),
  onset_to_death = function(x) dgamma(x = x, shape = 2, scale = 2),
  outbreak_size = c(5, 20)
)
#> Warning: Number of cases exceeds maximum outbreak size. 
#> Returning data early with 29 cases and 44 total contacts (including cases).
Created on 2024-05-03 with reprex v2.1.0

Ah! Thanks for this. That makes sense. I don't think we should have a hard lower limit in {epichains} as it is by definition the seeing cases, if the outbreak doesn't take off. The soft upper limit is what is already implemented as stat_max, so it's currently a matter of documentation which can be solved with a name change and a proper wording.

My concern is more about whether it is desirable behaviour. For example, if a user sets stat_max to 500 cases and gets 800 cases, that's quite undesirable. I think the solution would be what I suggested in https://github.com/epiverse-trace/epichains/pull/237#issuecomment-2092761867, where we sample the remainder from the excess cases and return exactly stat_max.

jamesmbaazam commented 3 months ago

Upon further discussion with Seb, we've decided it might be better to rename stat_max to stat_threshold and improve the wording to indicate that it is not a hard limit and can be exceeded due to stochastic effects. The discussion about making it a hard limit has been logged in #243 for further brainstorming.