CDCgov / multisignal-epi-inference

Python package for statistical inference and forecast of epi models using multiple signals
https://cdcgov.github.io/multisignal-epi-inference/
10 stars 1 forks source link

Tutorial: Porting An Existing Model (Influenza-Epidemia) To MSR #162

Open AFg6K7h4fhy2 opened 3 weeks ago

AFg6K7h4fhy2 commented 3 weeks ago

NOTE: This is currently a work in progress. I've committed a skeleton of the Quarto tutorial but need to continuing finishing the implementation in Python before porting over more and then writing in detail. Consider this draft PR and its initial commit a placeholder.

Recall the goal from the corresponding issue:

Verify that DHM's influenza forecasting repository cfa-forecast-renewal-epidemia, which utilizes epidemia, can be instantiated via MSR.

And also recall the desired features:

Create quarto walk-through detailing the port-over steps and providing documentation.

Validate MSR implemented cfa-forecast-renewal-epidemia via simulation and fitting on NHSN data, which is the data cfa-forecast-renewal-epidemia has historically been used on.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.39%. Comparing base (71228e5) to head (d5671aa). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #162 +/- ## ========================================== - Coverage 92.51% 92.39% -0.12% ========================================== Files 40 40 Lines 895 881 -14 ========================================== - Hits 828 814 -14 Misses 67 67 ``` | [Flag](https://app.codecov.io/gh/CDCgov/multisignal-epi-inference/pull/162/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/CDCgov/multisignal-epi-inference/pull/162/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov) | `92.39% <ø> (-0.12%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

AFg6K7h4fhy2 commented 1 week ago

@dylanhmorris

Would appreciate some Why thoughts on the following settings:

day_of_week_effect_prior_modes = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
day_of_week_effect_prior_scales = [0.25, 0.25, 0.25, 0.25, 0.25, 0.25]
holiday_eff_prior_mode = -0.2
holiday_eff_prior_scale = 0.2
post_holiday_eff_prior_mode = 0.0
post_holiday_eff_prior_scale = 0.2
recency_eff_prior_mode = 0.0
recency_eff_prior_scale = 0.1
recency_effect_length = 0

generation_time_dist = [0.233, 0.359, 0.198, 0.103, 0.053, 0.027, 0.014, 0.007, 0.003, 0.002, 0.001]
ihr_intercept_prior_mode = -6.5
ihr_intercept_prior_scale = 0.5
inf_model_prior_infections_per_capita = 0.0001
inf_model_seeds = 8
inf_to_hosp_dist = [0.05, 0.1, 0.175, 0.225, 0.175, 0.125, 0.075, 0.05, 0.025]
max_rt = 3.0
n_pre_observation_days = 14
non_obs_effect_prior_mode = -50
non_obs_effect_prior_scale = 0.01
reciprocal_dispersion_prior_mode = 10
reciprocal_dispersion_prior_scale = 5
reference_day_of_week = 4
rt_intercept_prior_mode = -0.4054651
rt_intercept_prior_scale = 0.3
seed = 54321
susceptible_fraction_prior_mode = 0.9
susceptible_fraction_prior_scale = 0.1
weekly_rw_prior_scale = 0.25
first_fitting_date = "2023-09-15"

[mcmc]
adapt_delta = 0.9
max_treedepth = 12
n_chains = 1
n_warmup = 1000
n_iter = 2000

Apologies for not putting this question here sooner. I do not know why I bottled it in my markdown file.

AFg6K7h4fhy2 commented 1 week ago

Transmission Component

I would also appreciate some thoughts on my attempted description of the transmission component of cfaepim:


In cfaepim, the influenza hospitalizations for each US territory are modelled separately, as though influenza transmission, infection, and observation within each territory's boundaries is completely independent of other territories. This corresponds to an "unpooled" modelling case (see Epidemia's partial pooling page). For each US territory, $\mathscr{R}_t$ can be modeled as a transformed linear prediction:

$$\mathscr{R}_t = g^{-1}(\eta_t)$$

where $g$ is a link function and $\eta_t$ is the prediction on the transformed scale.

The link function for cfaepim is a scaled logit, so:

$$g^{-1}(x) = \frac{K}{1 + e^{-x}}$$

with $K = 3$ since max_rt = 3.0. As for each territory's transformed scale prediction $\eta_t$, we have

$$\eta_t = \beta_0 + W_t$$

where $W_t$ is a random walk satisfying $$Wt = W{t-1} + \gamma_t$$ with $t > 0$ and $W_0 = 0$, and $\gamma_t \sim \mathcal{N}(0, \sigma)$, where $\sigma \sim \mathcal{N}^{+}(0, 0.25)$ since weekly_rw_prior_scale = 0.25. Note that the $\mathscr{R}_t$ intercept, i.e. $\beta_0$, has a prior. Since rt_intercept_prior_mode = -0.4054651 and rt_intercept_prior_scale = 0.3, $\beta_0 \sim \mathcal{N}(-0.4054651, 0.3)$.


Original R code for reference:

build_light_rt <- function(rt_intercept_prior_mode,
                           rt_intercept_prior_scale,
                           max_rt,
                           rw_prior_scale) {
  rt_model <- epidemia::epirt(
    formula = as.formula(
      sprintf(
        paste0(
          "R(location, date) ~ 1 + ",
          "rw(time = week, gr = location, prior_scale = %f)"
        ),
        rw_prior_scale
      )
    ),
    prior_intercept = rstanarm::normal(
      location = rt_intercept_prior_mode,
      scale = rt_intercept_prior_scale
    ),
    link = epidemia::scaled_logit(K = max_rt)
  )
  return(rt_model)
}
AFg6K7h4fhy2 commented 1 week ago

Observation Component

Similar to above, would appreciate some thoughts on the veracity of the following description:


Proceeding, each US territory has is own time series of observed influenza hospitalizations $Y = (Y_1, \dotsc, Y_n)$. The observed influenza hospitalizations at time $t$ are given by $$Y_t \sim p(y_t, \phi)$$ where $p(y_t, \phi)$ is the sampling distribution with auxiliary parameter $\phi$. Note that since reciprocal_dispersion_prior_mode = 10 and reciprocal_dispersion_prior_scale = 5, we have $\phi \sim \mathcal{N}(10, 5)$. Since family = "neg_binom", $\phi$ represents the reciprocal dispersion of the sampling distribution.

The expected value for the observed influenza hospitalizations on day $t$, i.e. $y_t$, is given by

$$y_t = \alphat \sum{s < t} is \pi{t-s}$$

with $s < t$ and where

Within cfaepim, the linear predictor for $\alpha_t$ consists of

We have:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i \end{aligned} $$

where

Within cfaepim, we have the infection to hospitalization delay distribution as

$$\pi = \begin{bmatrix} 0.05 & 0.1 & 0.175 & 0.225 & 0.175 & 0.125 & 0.075 & 0.05 & 0.025\end{bmatrix}$$

given the selected value for inf_to_hosp_dist.


The corresponding code in cfaepim:

build_light_obs <- function(inf_to_hosp_dist,
                            ihr_intercept_prior_mode,
                            ihr_intercept_prior_scale,
                            day_of_week_eff_prior_modes,
                            day_of_week_eff_prior_scales,
                            holiday_eff_prior_mode,
                            holiday_eff_prior_scale,
                            post_holiday_eff_prior_mode,
                            post_holiday_eff_prior_scale,
                            non_obs_effect_prior_mode,
                            non_obs_effect_prior_scale,
                            inv_dispersion_prior_mode,
                            inv_dispersion_prior_scale,
                            link = "logit") {
  return(epidemia::epiobs(
    formula = as.formula(paste0(
      "hosp ~ 1 + day_of_week + ",
      "is_holiday + ",
      "is_post_holiday + ",
      "nonobservation_period"
    )),
    ## Add a covariate for the
    ## nonobservation window to
    ## leave an initial evolution
    ## period with no observations
    i2o = inf_to_hosp_dist,
    link = link,
    family = "neg_binom",
    prior_intercept = rstanarm::normal(
      location = ihr_intercept_prior_mode,
      scale = ihr_intercept_prior_scale
    ),
    prior = rstanarm::normal(
      location = c(
        day_of_week_eff_prior_modes,
        holiday_eff_prior_mode,
        post_holiday_eff_prior_mode,
        non_obs_effect_prior_mode
      ),
      ## a large negative non_obs_effect
      ## effectively conditions on detection
      ## prob = 0 outside the observation period
      scale = c(
        day_of_week_eff_prior_scales,
        holiday_eff_prior_scale,
        post_holiday_eff_prior_scale,
        non_obs_effect_prior_scale
        ## non-obs prior scale
        ## should be small to
        ## enforce non-obs effect
        ## close to (large negative) mode
      )
    ),
    prior_aux = rstanarm::normal(
      location = inv_dispersion_prior_mode,
      scale = inv_dispersion_prior_scale
    )
  ))
}
dylanhmorris commented 1 week ago

@AFg6K7h4fhy2 terminology edits only for the post on the $\mathcal{R}(t)$ model.

dylanhmorris commented 1 week ago

Comments on the observation component description

Proceeding, each US territory has is own time series of observed influenza hospitalizations $Y = (Y_1, \dotsc, Y_n)$. The observed influenza hospitalizations at time $t$ are given by $$Y_t \sim p(y_t, \phi)$$ where $p(y_t, \phi)$ is the sampling distribution with auxiliary parameter $\phi$. Note that since reciprocal_dispersion_prior_mode = 10 and reciprocal_dispersion_prior_scale = 5, we have $\phi \sim \mathcal{N}(10, 5)$. Since family = "neg_binom", $\phi$ represents the reciprocal dispersion of the sampling distribution.

Should state explicitly that family = "neg_binom" means that $p$ becomes a negative binomial parameterized by its mean and reciprocal dispersion. I also might not use $p$ to denote the distribution, since it often denotes a specific probability. Perhaps something like $\sim \mathcal{D}(y_t, \phi)$ or $\sim \mathrm{Dist}(y_t, \phi)$ if your goal is to indicate that the choice of distribution is not fixed but rather determined by the family argument.

  • $\alpha_t>0$ represents an instantaneous ascertainment rate, which in this instance means the rate of influenza infections that become hospitalizations at time $t$

Suggest reported hospitalizations

transformed by the logit-link (since link = "logit") function $g$, where $$g^{-1}(x) = \frac{1}{1 + e^{-x}}$$

Suggest modeled by a linear model with a logit link

Within cfaepim, the linear predictor for $\alpha_t$ consists of

linear prediction

We have:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i \end{aligned} $$

Need to include the predictor variables, not just the coefficients:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i x_i \end{aligned} $$

  • each of the covariates $\beta_1, \dotsc, \beta_6$ have priors, $\beta_i \sim \mathcal{N}(0, 0.25) \quad i = 1, \dotsc, 6$, corresponding, to the week from Sunday to Friday (where the reference date is Saturday)
  1. The $\beta_i$ are not covariates, they are coefficients. Covariate has a more or less precise meaning depending on who is talking, but the closer thing here would be the predictor variables $x_i$, not the corresponding coefficients $\beta_i$.
  2. The reference date is not necessarily Saturday. It is set in the config by the reference_day_of_week entry. The day of week values are then encoded relative to this reference day in the build_state_light_model() function here:

    ## make sure day_of_week is properly
    ## set up as a factor
    dow_levels <- levels(
    lubridate::wday("2023-01-01",
      label = TRUE,
      week_start = params$reference_day_of_week
    )
    )
    
    clean_data <- clean_data |>
    dplyr::mutate(
      day_of_week = factor(day_of_week,
        ordered = FALSE,
        levels = dow_levels
      )
    )

    @AFg6K7h4fhy2