Closed ben18785 closed 1 month ago
It's fine @ben18785, I also think we can improve these functions a lot to make them easier and more intuitive to use.
Just to clarify some points. As stated in the simdata_large_epi
documentation (which is very incomplete, my bad), this serosurvey is intended to emulate a large epidemic ocurring during 3 years at some point in time. The right FoI to use in this case is thus something like:
foi_sim_large_epi <- c(rep(0, 32), rep(1.5, 3), rep(0, 15))
This is the red line in the corresponding example on the vignettes.
Another important point to note is the age structure of the serosurvey:
$ total : num 20 25 25 25 25 25 25 25 25 30
$ counts : int 0 3 2 12 6 11 10 10 18 17
$ age_min : num 2 7 12 17 22 27 32 37 42 47
$ age_max : num 2 7 12 17 22 27 32 37 42 47
$ tsur : num 2050 2050 2050 2050 2050 2050 2050 2050 2050 2050
At the time I generated the simulated datasets I didn't realized that the ages were not saved correctly; it should read:
$ total : num 20 25 25 25 25 25 25 25 25 30
$ counts : int 0 3 2 12 6 11 10 10 18 17
$ age_min : num 1 6 11 16 21 26 31 36 41 46
$ age_max : num 5 10 15 20 25 30 35 40 45 50
$ tsur : num 2050 2050 2050 2050 2050 2050 2050 2050 2050 2050
This doesn't affect the rest of the calculation because when computing the age group markers age_mean_f
(which stands for the mean floor between age_min
and age_max
) in prepare_serodata
it yields a valid result anyway.
On the other hand, note that this serosurvey is grouped by age. As you pointed out, we don't have a direct way to simulate heterogeneous age cohorts. The way we bypass this is simulating data for each age between min(age_min)
and max(age_max)
and then grouping the data. This looks like:
serodata_sim <- generate_sim_data(
sim_data = data.frame(
age=seq(1,50),
tsur=2050),
foi=foi_sim_large_epi,
sample_size_by_age = c(rep(4,5), rep(5,40), rep(6,5))
)
'data.frame': 50 obs. of 7 variables:
$ age : int 1 2 3 4 5 6 7 8 9 10 ...
$ tsur : num 2050 2050 2050 2050 2050 2050 2050 2050 2050 2050 ...
$ age_min: int 1 2 3 4 5 6 7 8 9 10 ...
$ age_max: int 1 2 3 4 5 6 7 8 9 10 ...
$ counts : int 0 0 0 0 0 0 0 0 0 0 ...
$ total : num 4 4 4 4 4 5 5 5 5 5 ...
$ survey : chr "sim_data" "sim_data" "sim_data" "sim_data" ...
And then:
serodata_sim <- group_sim_data(serodata_sim, step=5)
which returns a prepared serosurvey with the characteristics we wanted (in retrospective I think we shouldn't use prepare_serodata
by default here, but return a simpler output):
age_group total counts tsur country survey age_min age_max age_mean_f sample_size birth_year prev_obs prev_obs_lower prev_obs_upper
X (0,5] 20 0 2050 None sim_data 1 5 3 250 2047 0.00 0.0000000 0.1684335
X.1 (5,10] 25 0 2050 None sim_data 6 10 8 250 2042 0.00 0.0000000 0.1371852
X.2 (10,15] 25 0 2050 None sim_data 11 15 13 250 2037 0.00 0.0000000 0.1371852
X.3 (15,20] 25 23 2050 None sim_data 16 20 18 250 2032 0.92 0.7396942 0.9901604
X.4 (20,25] 25 25 2050 None sim_data 21 25 23 250 2027 1.00 0.8628148 1.0000000
X.5 (25,30] 25 25 2050 None sim_data 26 30 28 250 2022 1.00 0.8628148 1.0000000
X.6 (30,35] 25 25 2050 None sim_data 31 35 33 250 2017 1.00 0.8628148 1.0000000
X.7 (35,40] 25 24 2050 None sim_data 36 40 38 250 2012 0.96 0.7964831 0.9989878
X.8 (40,45] 25 25 2050 None sim_data 41 45 43 250 2007 1.00 0.8628148 1.0000000
X.9 (45,50] 30 30 2050 None sim_data 46 50 48 250 2002 1.00 0.8842967 1.0000000
Baring this in mind:
generate_sim_data
is clunky and vague;simulate_serosurvey
uses full words and is immediately understandable, and I propose this change of name.In addition, I think we should consider changes to its arguments; current arguments for this include:
sim_data
which is a vague name and I would propose changing it toserosurvey_characteristics
. This currently has a column calledage_mean_f
which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be namedages_surveyed
; I would suggest changingtsur
toyear_survey
.sample_size_by_age
would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming itn_sample
foi
works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns:age
andfoi
for age-varying FOIs;year
andfoi
for time-varying FOIs; andage
,year
andfoi
for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.
I agree on changing the names of the functions and variables as you suggest. Some of them may be more troublesome than others, since we have to change them as well in the preloaded datasets in order for the R-CMD checks to pass (just as we need to do for #112). I suggest we open a separate issue and PR to address these changes of names.
We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:
- We should check whether FOI is of right size (much easier to do when we have a data frame input)
- sample sizes by age (which I think can be handled much more nicely as described above as an input to the old
sim_data
argument)
This was recently addressed by @jpavlich in #168. I merged it today, so please rebase your branch before continuing.
I also propose changes to the output of the function:
- I don't understand why we return a data frame that has:
age_mean_f
,age_min
andage_max
since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have thistsur
could beyear_survey
counts
could ben_seropositive
total
should ben_sample
to mirror the function inputssurvey
can just be removed (and I'd suggest removing it from the inputs)
I think we should keep at least age_min
and age_max
since we can use them to generate grouped surveys as I explained above. I agree with you on the others, we should simplify the output as much as possible.
As discussed in in-person meeting, we will change this to be of the form:
feature_df <- data.frame(
age_min=c(1, 6, 11),
age_max=c(5, 10, 20),
sample_size=c(10, 15, 20),
year_survey=c(2010, 2010, 2010)
)
serosurvey_time_example <- simulate_serosurvey(
model="time",
foi = data.frame(year=c(1990,1991,...,2009), foi=c(0.1, 0.2, ..., 0.3)),
seroreversion=0,
survey_features=feature_df
)
serosurvey_age_example <- simulate_serosurvey(
model="age",
foi = data.frame(age=c(1,2,...,20), foi=c(0.1, 0.2, ..., 0.3)),
seroreversion=0.2,
survey_features=feature_df
)
ages <- seq(1, 10, 1)
years <- seq(1990, 2000, 1)
foi_age_and_time <- expand_grid(year=years, age=ages) %>%
mutate(foi=0.1)
serosurvey_age_and_time_example <- simulate_serosurvey(
model="age-time",
foi = foi_age_and_time,
seroreversion=0.2,
survey_features=feature_df
)
But internally & externally we will have e.g. simulate_serosurvey_age_model
which will be called by the helper function simulate_serosurvey
.
Sorry @ntorresd -- I'm revisiting this one as I've genuinely found it hard to use this function.
To give an example, I am trying to replace the reliance on the
simdata_large_epi
file in a vignette by replacing it with a simulated dataset as per #160. I tried:df <- generate_sim_data( sim_data=data.frame(age_mean_f=c(2, 7, 12, 17, 22, 27, 32, 37, 42, 47), tsur=2050), foi=rep(1.5, 47), sample_size_by_age = c(20, 25, 25, 25, 25, 25, 25, 25, 25, 30) )
which I thought should work since it allows 47 FOIs (which seems sensible...from 1 up until a max age of 47). From reading the function documentation, it's still not clear to me how to get this to work, and I think we should make various changes.
generate_sim_data
is clunky and vague;simulate_serosurvey
uses full words and is immediately understandable, and I propose this change of name.In addition, I think we should consider changes to its arguments; current arguments for this include:
sim_data
which is a vague name and I would propose changing it toserosurvey_characteristics
. This currently has a column calledage_mean_f
which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be namedages_surveyed
; I would suggest changingtsur
toyear_survey
.sample_size_by_age
would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming itn_sample
foi
works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns:age
andfoi
for age-varying FOIs;year
andfoi
for time-varying FOIs; andage
,year
andfoi
for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:
sim_data
argument)I also propose changes to the output of the function:
age_mean_f
,age_min
andage_max
since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have thistsur
could beyear_survey
counts
could ben_seropositive
total
should ben_sample
to mirror the function inputssurvey
can just be removed (and I'd suggest removing it from the inputs)