epiverse-trace / epichains

[Under active development] Methods for simulating and analysing the sizes and lengths of infectious disease transmission chains from branching process models
https://epiverse-trace.github.io/epichains/
Other
5 stars 2 forks source link

Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

Closed jamesmbaazam closed 3 months ago

jamesmbaazam commented 3 months ago

This PR closes #238 and #175.

Function enhancements + Documentation improvements

Currently, the <epichains> object returns columns with names infectee_id, sim_id, infector_id, and generation, and optionally, susc_pop and time (if pop and generation_time are specified respectively). However, these columns are confusing, swapped in interpretation, and not straightforward to explain as noted in the linked issues. The sim_id column is also not unique across the dataset, making it hard to interpret.

User-facing changes

Non-user-facing changes

Additionally, some of the objects in the code have been renamed and comments have been improved to make the code (hopefully) easier to read.

NA (package unpublished).

jamesmbaazam commented 3 months ago

This PR also creates an opportunity to plot the network contained the returned <epichains> object using {epicontacts}. Here's a reprex.

reprex::reprex({
  library(epicontacts)
  # Install this PR and load the package
  pak::pkg_install("epiverse-trace/epichains#242")
  library(epichains)
  # Simulate an outbreak
  set.seed(32)
  outbreak <- simulate_chains(
    index_cases = 5,
    statistic = "size",
    offspring_dist = rpois,
    generation_time = function(n) rlnorm(n, meanlog = 0.58, sdlog = 1.58),
    lambda = 1.5,
    stat_max = 30
  )
 # Create an epicontacts object
  plot_df <- make_epicontacts(
    linelist = outbreak,
    contacts = outbreak,
    id = "infectee",
    from = "infector",
    to = "infectee",
    directed = TRUE
  )
  # Plot the epicontacts object
  plot(plot_df)
})

Screenshot 2024-05-04 at 15 23 22

The code used above can be converted to a simple S3 plot.epichains() method that takes an <epichains> object and extracts the right columns, passes them to epicontacts to do the plotting. This might be useful to the user.

sbfnk commented 3 months ago

Before I go into detailed review can I just ask if one change I spotted was intentional, as it affects interpretation of the results and what is the "correct" way of accounting.

Before (e.g. in bpmodels) each index case (/simulation) generated their own infectee ids, i.e. we could have

sim_id   infectee_id
     1             1
     1             2
     1             3
     2             1
     2             2

etc.

Now the infectee_id is shared across all index cases (/simulations)

index_case   infectee_id
         1             1
         1             2
         1             3
         2             4
         2             5

I think the way we want this relates to how we interpret the simulations (this also affects the sim_id vs index_case naming):

It only occurred to me when reviewing this PR that we're conflating these two concepts (also noting that the first argument is called index_cases). We could support one of the two options (where option 1 requires least effort as it doesn't require any updating of the indexing) or perhaps both, but we should delineate clearly between them.

jamesmbaazam commented 3 months ago

Now the infectee_id is shared across all index cases (/simulations)

Yes, this change was in response to your suggestions in the linked issues and my summary of how I understood it here https://github.com/epiverse-trace/epichains/issues/238#issuecomment-2088396531.

For now, I'll revert to option 1, i.e., using the sim_id column name (so as not to delay the version release). We can revisit the second option in the future.

jamesmbaazam commented 3 months ago

@sbfnk Here is what a reprex of what your comments here + in #238 (i.e., rename sim_id to infectee) would look like

library(epichains)
  set.seed(123)
  epc_out <- simulate_chains(
    index_cases = 5,
    statistic = "length",
    offspring_dist = rpois,
    stat_max = 100,
    lambda = 0.5
  )
  epc_out
#>    index_case infector infectee generation
#> 1           1       NA        1          1
#> 2           2       NA        1          1
#> 3           3       NA        1          1
#> 4           4       NA        1          1
#> 5           5       NA        1          1
#> 6           2        1        2          2
#> 7           4        1        2          2
#> 8           5        1        2          2
#> 9           5        1        3          2
#> 10          5        2        4          3
Created on 2024-05-09 with [reprex v2.1.0](https://reprex.tidyverse.org/)

The index_case column refers to the seeding index cases that remain active through the generations, and the previous sim_id column is renamed to infectee (in response to #238) with unique IDs within simulations but not shared across.

Does this reflect what you were suggesting above?

sbfnk commented 3 months ago

Does this reflect what you were suggesting above?

Yes except that I'd revert index_case to be called sim_id (and probably also rename the corresponding argument to nsims or similar - so they're not index cases within the same epidemic, but different epidemics. Does that make sense?

jamesmbaazam commented 3 months ago

Does this reflect what you were suggesting above?

Yes except that I'd revert index_case to be called sim_id (and probably also rename the corresponding argument to nsims or similar - so they're not index cases within the same epidemic, but different epidemics. Does that make sense?

Yes, it does. I've made the changes now.

jamesmbaazam commented 3 months ago

Great! Do you think it would be perhaps clearer if we called the column simulation instead of sim_id? Would then link with the message below where we say "number of simulations". We could also think about renaming nsims to simulations.

I don't feel strongly about either of them and am happy to leave the decision up to you.

Thanks, Seb. Just leaving a note here that after discussing this today, we agreed to change the argument to nchains, the column name to chain, and the message under the print message to "Number of chains: ". I'll make the changes.