Changes to `generate_sim_data`

ben18785 commented 6 months ago

Sorry @ntorresd -- I'm revisiting this one as I've genuinely found it hard to use this function.

To give an example, I am trying to replace the reliance on the simdata_large_epi file in a vignette by replacing it with a simulated dataset as per #160. I tried:

df <- generate_sim_data( sim_data=data.frame(age_mean_f=c(2, 7, 12, 17, 22, 27, 32, 37, 42, 47), tsur=2050), foi=rep(1.5, 47), sample_size_by_age = c(20, 25, 25, 25, 25, 25, 25, 25, 25, 30) )

which I thought should work since it allows 47 FOIs (which seems sensible...from 1 up until a max age of 47). From reading the function documentation, it's still not clear to me how to get this to work, and I think we should make various changes.

generate_sim_data is clunky and vague; simulate_serosurvey uses full words and is immediately understandable, and I propose this change of name.

In addition, I think we should consider changes to its arguments; current arguments for this include:

sim_data which is a vague name and I would propose changing it to serosurvey_characteristics. This currently has a column called age_mean_f which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be named ages_surveyed; I would suggest changing tsur to year_survey. sample_size_by_age would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming it n_sample
foi works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns: age and foi for age-varying FOIs; year and foi for time-varying FOIs; and age, year and foi for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.

We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:

We should check whether FOI is of right size (much easier to do when we have a data frame input)
sample sizes by age (which I think can be handled much more nicely as described above as an input to the old sim_data argument)

I also propose changes to the output of the function:

I don't understand why we return a data frame that has: age_mean_f, age_min and age_max since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have this
tsur could be year_survey
counts could be n_seropositive
total should be n_sample to mirror the function inputs
survey can just be removed (and I'd suggest removing it from the inputs)

ntorresd commented 6 months ago

It's fine @ben18785, I also think we can improve these functions a lot to make them easier and more intuitive to use.

Just to clarify some points. As stated in the simdata_large_epi documentation (which is very incomplete, my bad), this serosurvey is intended to emulate a large epidemic ocurring during 3 years at some point in time. The right FoI to use in this case is thus something like:

foi_sim_large_epi <- c(rep(0, 32), rep(1.5, 3), rep(0, 15))

This is the red line in the corresponding example on the vignettes.

Another important point to note is the age structure of the serosurvey:

 $ total   : num  20 25 25 25 25 25 25 25 25 30
 $ counts  : int  0 3 2 12 6 11 10 10 18 17
 $ age_min : num  2 7 12 17 22 27 32 37 42 47
 $ age_max : num  2 7 12 17 22 27 32 37 42 47
 $ tsur    : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050

At the time I generated the simulated datasets I didn't realized that the ages were not saved correctly; it should read:

 $ total   : num  20 25 25 25 25 25 25 25 25 30
 $ counts  : int  0 3 2 12 6 11 10 10 18 17
 $ age_min : num  1 6 11 16 21 26 31 36 41 46
 $ age_max : num  5 10 15 20 25 30 35 40 45 50
 $ tsur    : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050

This doesn't affect the rest of the calculation because when computing the age group markers age_mean_f (which stands for the mean floor between age_min and age_max) in prepare_serodata it yields a valid result anyway.

On the other hand, note that this serosurvey is grouped by age. As you pointed out, we don't have a direct way to simulate heterogeneous age cohorts. The way we bypass this is simulating data for each age between min(age_min) and max(age_max) and then grouping the data. This looks like:

serodata_sim <- generate_sim_data(
  sim_data = data.frame(
    age=seq(1,50),
    tsur=2050),
  foi=foi_sim_large_epi,
  sample_size_by_age = c(rep(4,5), rep(5,40), rep(6,5))
)

'data.frame':   50 obs. of  7 variables:
 $ age    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ tsur   : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050 ...
 $ age_min: int  1 2 3 4 5 6 7 8 9 10 ...
 $ age_max: int  1 2 3 4 5 6 7 8 9 10 ...
 $ counts : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total  : num  4 4 4 4 4 5 5 5 5 5 ...
 $ survey : chr  "sim_data" "sim_data" "sim_data" "sim_data" ...

And then:

serodata_sim <- group_sim_data(serodata_sim, step=5)

which returns a prepared serosurvey with the characteristics we wanted (in retrospective I think we shouldn't use prepare_serodata by default here, but return a simpler output):

    age_group total counts tsur country   survey age_min age_max age_mean_f sample_size birth_year prev_obs prev_obs_lower prev_obs_upper
X       (0,5]    20      0 2050    None sim_data       1       5          3         250       2047     0.00      0.0000000      0.1684335
X.1    (5,10]    25      0 2050    None sim_data       6      10          8         250       2042     0.00      0.0000000      0.1371852
X.2   (10,15]    25      0 2050    None sim_data      11      15         13         250       2037     0.00      0.0000000      0.1371852
X.3   (15,20]    25     23 2050    None sim_data      16      20         18         250       2032     0.92      0.7396942      0.9901604
X.4   (20,25]    25     25 2050    None sim_data      21      25         23         250       2027     1.00      0.8628148      1.0000000
X.5   (25,30]    25     25 2050    None sim_data      26      30         28         250       2022     1.00      0.8628148      1.0000000
X.6   (30,35]    25     25 2050    None sim_data      31      35         33         250       2017     1.00      0.8628148      1.0000000
X.7   (35,40]    25     24 2050    None sim_data      36      40         38         250       2012     0.96      0.7964831      0.9989878
X.8   (40,45]    25     25 2050    None sim_data      41      45         43         250       2007     1.00      0.8628148      1.0000000
X.9   (45,50]    30     30 2050    None sim_data      46      50         48         250       2002     1.00      0.8842967      1.0000000

Baring this in mind:

generate_sim_data is clunky and vague; simulate_serosurvey uses full words and is immediately understandable, and I propose this change of name.

In addition, I think we should consider changes to its arguments; current arguments for this include:

sim_data which is a vague name and I would propose changing it to serosurvey_characteristics. This currently has a column called age_mean_f which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be named ages_surveyed; I would suggest changing tsur to year_survey. sample_size_by_age would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming it n_sample

foi works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns: age and foi for age-varying FOIs; year and foi for time-varying FOIs; and age, year and foi for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.

I agree on changing the names of the functions and variables as you suggest. Some of them may be more troublesome than others, since we have to change them as well in the preloaded datasets in order for the R-CMD checks to pass (just as we need to do for #112). I suggest we open a separate issue and PR to address these changes of names.

We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:

We should check whether FOI is of right size (much easier to do when we have a data frame input)

sample sizes by age (which I think can be handled much more nicely as described above as an input to the old sim_data argument)

This was recently addressed by @jpavlich in #168. I merged it today, so please rebase your branch before continuing.

I also propose changes to the output of the function:

I don't understand why we return a data frame that has: age_mean_f, age_min and age_max since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have this

tsur could be year_survey

counts could be n_seropositive

total should be n_sample to mirror the function inputs

survey can just be removed (and I'd suggest removing it from the inputs)

I think we should keep at least age_min and age_max since we can use them to generate grouped surveys as I explained above. I agree with you on the others, we should simplify the output as much as possible.

ben18785 commented 4 months ago

As discussed in in-person meeting, we will change this to be of the form:

feature_df <- data.frame(
    age_min=c(1, 6, 11), 
    age_max=c(5, 10, 20),
    sample_size=c(10, 15, 20),
    year_survey=c(2010, 2010, 2010)
)

serosurvey_time_example <- simulate_serosurvey(
  model="time",
  foi = data.frame(year=c(1990,1991,...,2009), foi=c(0.1, 0.2, ..., 0.3)),
  seroreversion=0,
  survey_features=feature_df
)

serosurvey_age_example <- simulate_serosurvey(
  model="age",
  foi = data.frame(age=c(1,2,...,20), foi=c(0.1, 0.2, ..., 0.3)),
  seroreversion=0.2,
  survey_features=feature_df
)

ages <- seq(1, 10, 1)
years <- seq(1990, 2000, 1)
foi_age_and_time <- expand_grid(year=years, age=ages) %>% 
  mutate(foi=0.1)

serosurvey_age_and_time_example <- simulate_serosurvey(
  model="age-time",
  foi = foi_age_and_time,
  seroreversion=0.2,
  survey_features=feature_df
)

But internally & externally we will have e.g. simulate_serosurvey_age_model which will be called by the helper function simulate_serosurvey.

epiverse-trace / serofoi

Changes to `generate_sim_data` #169