epiverse-trace / serofoi

Estimates the Force-of-Infection of a given pathogen from population based sero-prevalence studies
https://epiverse-trace.github.io/serofoi/
Other
17 stars 4 forks source link

Full refactorization of modelling and visualisation functions #200

Closed ntorresd closed 2 months ago

ntorresd commented 3 months ago

This PR replace the old modelling and visualisation functions for refactored versions agreed on by the development team of the package (@zmcucunuba, @ben18785, @sumalibajaj, @jpavlich, @ekamau) earlier this year.

We designed this new functions for the selection and the specification of the Bayesian models to be more flexible than before, as well as including additional constant/time/age varying models with the possibility to estimate the seroreversion rate $\mu$. In particular, the usage of fit_seromodel() went from (e.g.):

seromodel <- fit_seromodel(
  serodata = serodata,
  foi_model = "tv_normal",
  foi_location = 0,
  foi_scale = 1,
  ...
)

to

seromodel <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_normal(0,1),
  ...
)

Note that we refer to the serological survey now as serosurvey rather than serodata, and we restrict it to have the following data:

The function sf_normal is one of a set of auxiliary functions designed to specify the prior distributions of the parameters to be estimated (FOI or seroreversion rate), as suggested in #193 . Currently available priors are sf_normal() and sf_uniform(), corresponding to Gaussian and uniform priors respectively.

To illustrate the current pipeline, consider a disease with constant FOI $\lambda = 0.02$ and seroreversion rate $\mu=0.01$:

foi <- data.frame(
  year = 1951:2000,
  foi = rep(0.01, 50)
)
seroreversion_rate <- 0.01

and a serological survey with the following features:

survey_features <- data.frame(
  age_min = 1:50,
  age_max = 1:50,
  sample_size = runif(0,100)
)

$ age_min     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ age_max     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ sample_size <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84, 45…

We can use simulate_serosurvey(), introduced in #199, to simulate statistically consistent seropositive counts n_seropositive for each age group, according to our set of serological models. In this case, I use the time-varying model:

serosurvey <- simulate_serosurvey(
  model = "time",
  foi = foi,
  survey_features = survey_features,
  seroreversion_rate = seroreversion_rate
) %>% mutate(
  tsur = max(foi$year)
)

$ age_min        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ age_max        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ sample_size    <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84,…
$ n_seropositive <int> 4, 1, 2, 12, 14, 2, 9, 19, 12, 27, 18, 6, 14, 12, 5, 19, 28, 31, 45, 26, 25, 13, 37, 30, 42…
$ tsur           <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…

which we can visualise using plot_serosurvey:

plot_serosurvey(serosurvey = serosurvey)

image

Implementing the constant model with and without seroreversion by means of fit_seromodel():

# initialization function for sampling
init <- function() {
  list(foi_vector = rep(0.1, nrow(survey_features)))
}

# constant model without seroreversion
seromodel_constant_no_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  init = init
)

# constant model with seroreversion
seromodel_constant_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init
)

Visualizing the results:

plot_no_serorev <- plot_seromodel(
  seromodel = seromodel_constant_no_serorev,
  serosurvey = serosurvey
)

plot_serorev <- plot_seromodel(
  seromodel = seromodel_constant_serorev,
  serosurvey = serosurvey
)

cowplot::plot_grid(plot_no_serorev, plot_serorev)

image

Note that the model estimating the seroreversion rate (right panel in the image), no only is a better fit for this serological survey according to the elpd value, but it also accurately estimates both the FOI and seroreversion rate we used to simulate the serosurvey.

Another difference with previous versions lies in how we visualize the estimated parameters. Since the constant model is estimating a single FOI value, we only show the estimated value with its corresponding credible interval (we use to plot it instead). However, for the time and age varying models we estimate several values of the FOI in the time/age span of the survey, which we visualize graphically. Take for instance the time varying model with seroreversion:

# implementing the time-varying model with seroreversion
seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init,
  iter = 4000
)
# plotting
plot_time_serorev <- plot_seromodel(
  seromodel = seromodel_time_serorev,
  serosurvey = serosurvey
)

image

Here we plot the estimated values of the FOI as a function of time (blue line and shadow) and the R-hat estimates for each estimated value (black dots). Note that, even though we ran the model for a large number of iterations - 4000, the model did not converged (since there are R-hat values over 1.01). We can improve the convergence of the model by estimating less values of the FOI. To specify the time intervals for which FOI values will be estimated, we index them by means of get_foi_index():

foi_index <- get_foi_index(
  serosurvey = serosurvey,
  group_size = 5
  )

> foi_index
 [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6  6  6  7  7  7  7  7  8  8
[38]  8  8  8  9  9  9  9  9 10 10 10 10 10

This function returns a list of indexes that labels each year (or age, for age-varying models). The idea is that a single FOI value will be estimated for each index, meaning that 10 FOI values will be estimated in this particular case:

init <- function() {
  list(foi_vector = rep(0.1, max(foi_index)))
}

seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  foi_index = foi_index,
  init = init,
  iter = 4000
)

plot_seromodel(
  seromodel_time_serorev,
  serosurvey = serosurvey
)

image

Now the model converges. As long as foi_index length covers the whole time/age span of the serological survey, less regular indexations foi_index can be specified.

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

ben18785 commented 2 months ago

Hi @ntorresd,

This is so great. Thank you so so much for this work -- it lays the foundation for a whole lot more work for us all and does so in a really elegant way. (Well done team! Here's to Bogota 2025...)

A few thoughts. These are generally minor and could be discussed and handled in separate PRs if this isn't the right forum:

ntorresd commented 2 months ago

Hi @ben18785, thanks a lot for your feedback!

I'll address some of your points in this PR and open issues to address the others, as there are some closely related to user's feedback collected during the previous user test:

Right now, this variable is named survey_year. For the time being I'll leave it as it is, as we don't plan to implement seasonal models in the near future. We can change this easily once we have reached that point (if we decide to implement it in serofoi and not elsewhere anyway).

sample_size -> n_sample. We consistently use n_x to mean a count of x throughout the package, and I'd suggest we do so here.

I'll implement this one before merging for the sake of names consistency.

I've opened #205 to address this.

I love the example you have about reducing the numbers of parameters being estimated and again I think these should go into one of the package articles.

I think we should address this one in #204 , opened by one of our users @JDConejeros during the user test; it's a good chance to give a compelling explanation of this concept.

I added a comment in #202 about this, so we can discuss this further over there.

Yes, this is a must. I opened #206 to address this.

I'd rather keep for the time being. During the user test it was useful to have it exported.

I haven't run it yet, but it should be very low as the only part of the code covered by unit testing right now is the data simulation module. We've discussed unit testing with @jpavlich and we decided to start over as the structure of the packaged has changed so much. After merging this PR to dev, I'll open some issues about this. The idea is to have >90% coverage before submitting to CRAN.

  • Silly question but I couldn't find this on a quick look: do we unit-test the custom functions we use in Stan?

Not a silly question at all. As mentioned before, we are not testing for any of this right now. We should implement this before merging to main, so this is something we will discuss soon enough.

github-actions[bot] commented 2 months ago

This pull request:

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

ntorresd commented 1 month ago

This PR closes: