Tutorial: Porting An Existing Model (Influenza-Epidemia) To MSR

NOTE: This is currently a work in progress. I've committed a skeleton of the Quarto tutorial but need to continuing finishing the implementation in Python before porting over more and then writing in detail. Consider this draft PR and its initial commit a placeholder.

Recall the goal from the corresponding issue:

Verify that DHM's influenza forecasting repository cfa-forecast-renewal-epidemia, which utilizes epidemia, can be instantiated via MSR.

And also recall the desired features:

Create quarto walk-through detailing the port-over steps and providing documentation.

Validate MSR implemented cfa-forecast-renewal-epidemia via simulation and fitting on NHSN data, which is the data cfa-forecast-renewal-epidemia has historically been used on.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.39%. Comparing base (71228e5) to head (d5671aa). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #162 +/- ## ========================================== - Coverage 92.51% 92.39% -0.12% ========================================== Files 40 40 Lines 895 881 -14 ========================================== - Hits 828 814 -14 Misses 67 67 ``` | [Flag](https://app.codecov.io/gh/CDCgov/multisignal-epi-inference/pull/162/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/CDCgov/multisignal-epi-inference/pull/162/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov) | `92.39% <ø> (-0.12%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=CDCgov#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

@dylanhmorris

Would appreciate some Why thoughts on the following settings:

day_of_week_effect_prior_modes = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
day_of_week_effect_prior_scales = [0.25, 0.25, 0.25, 0.25, 0.25, 0.25]
holiday_eff_prior_mode = -0.2
holiday_eff_prior_scale = 0.2
post_holiday_eff_prior_mode = 0.0
post_holiday_eff_prior_scale = 0.2
recency_eff_prior_mode = 0.0
recency_eff_prior_scale = 0.1
recency_effect_length = 0

generation_time_dist = [0.233, 0.359, 0.198, 0.103, 0.053, 0.027, 0.014, 0.007, 0.003, 0.002, 0.001]
ihr_intercept_prior_mode = -6.5
ihr_intercept_prior_scale = 0.5
inf_model_prior_infections_per_capita = 0.0001
inf_model_seeds = 8
inf_to_hosp_dist = [0.05, 0.1, 0.175, 0.225, 0.175, 0.125, 0.075, 0.05, 0.025]
max_rt = 3.0
n_pre_observation_days = 14
non_obs_effect_prior_mode = -50
non_obs_effect_prior_scale = 0.01
reciprocal_dispersion_prior_mode = 10
reciprocal_dispersion_prior_scale = 5
reference_day_of_week = 4
rt_intercept_prior_mode = -0.4054651
rt_intercept_prior_scale = 0.3
seed = 54321
susceptible_fraction_prior_mode = 0.9
susceptible_fraction_prior_scale = 0.1
weekly_rw_prior_scale = 0.25
first_fitting_date = "2023-09-15"

[mcmc]
adapt_delta = 0.9
max_treedepth = 12
n_chains = 1
n_warmup = 1000
n_iter = 2000

Apologies for not putting this question here sooner. I do not know why I bottled it in my markdown file.

Transmission Component

I would also appreciate some thoughts on my attempted description of the transmission component of cfaepim:

In cfaepim, the influenza hospitalizations for each US territory are modelled separately, as though influenza transmission, infection, and observation within each territory's boundaries is completely independent of other territories. This corresponds to an "unpooled" modelling case (see Epidemia's partial pooling page). For each US territory, $\mathscr{R}_t$ can be modeled as a transformed linear prediction:

$$\mathscr{R}_t = g^{-1}(\eta_t)$$

where $g$ is a link function and $\eta_t$ is the prediction on the transformed scale.

The link function for cfaepim is a scaled logit, so:

$$g^{-1}(x) = \frac{K}{1 + e^{-x}}$$

with $K = 3$ since max_rt = 3.0. As for each territory's transformed scale prediction $\eta_t$, we have

$$\eta_t = \beta_0 + W_t$$

where $W_t$ is a random walk satisfying $$Wt = W{t-1} + \gamma_t$$ with $t > 0$ and $W_0 = 0$, and $\gamma_t \sim \mathcal{N}(0, \sigma)$, where $\sigma \sim \mathcal{N}^{+}(0, 0.25)$ since weekly_rw_prior_scale = 0.25. Note that the $\mathscr{R}_t$ intercept, i.e. $\beta_0$, has a prior. Since rt_intercept_prior_mode = -0.4054651 and rt_intercept_prior_scale = 0.3, $\beta_0 \sim \mathcal{N}(-0.4054651, 0.3)$.

Original R code for reference:

build_light_rt <- function(rt_intercept_prior_mode,
                           rt_intercept_prior_scale,
                           max_rt,
                           rw_prior_scale) {
  rt_model <- epidemia::epirt(
    formula = as.formula(
      sprintf(
        paste0(
          "R(location, date) ~ 1 + ",
          "rw(time = week, gr = location, prior_scale = %f)"
        ),
        rw_prior_scale
      )
    ),
    prior_intercept = rstanarm::normal(
      location = rt_intercept_prior_mode,
      scale = rt_intercept_prior_scale
    ),
    link = epidemia::scaled_logit(K = max_rt)
  )
  return(rt_model)
}

Observation Component

Similar to above, would appreciate some thoughts on the veracity of the following description:

Proceeding, each US territory has is own time series of observed influenza hospitalizations $Y = (Y_1, \dotsc, Y_n)$. The observed influenza hospitalizations at time $t$ are given by $$Y_t \sim p(y_t, \phi)$$ where $p(y_t, \phi)$ is the sampling distribution with auxiliary parameter $\phi$. Note that since reciprocal_dispersion_prior_mode = 10 and reciprocal_dispersion_prior_scale = 5, we have $\phi \sim \mathcal{N}(10, 5)$. Since family = "neg_binom", $\phi$ represents the reciprocal dispersion of the sampling distribution.

The expected value for the observed influenza hospitalizations on day $t$, i.e. $y_t$, is given by

$$y_t = \alphat \sum{s < t} is \pi{t-s}$$

with $s < t$ and where

$\pi_{k}$ is the time distribution from an infection to an observation
$i_s$ are new infections as modelled via a renewal equation
$\alpha_t>0$ represents an instantaneous ascertainment rate, which in this instance means the rate of influenza infections that become hospitalizations at time $t$, transformed by the logit-link (since link = "logit") function $g$, where $$g^{-1}(x) = \frac{1}{1 + e^{-x}}$$

Within cfaepim, the linear predictor for $\alpha_t$ consists of

A day of the week effect on observation probability, relative to the reference day of the week (day_of_week).
A holiday effect, if the day is a holiday (is_holiday).
A post-holiday effect, if the day occurs after a holiday (is_post_holiday).

We have:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i \end{aligned} $$

where

the infection to hospitalization delay distribution intercept $\beta_0 \sim \mathcal{N}(-6.5, 0.5)$, since ihr_intercept_prior_mode = -6.5 and ihr_intercept_prior_scale = 0.5 .
each of the covariates $\beta_1, \dotsc, \beta_6$ have priors, $\beta_i \sim \mathcal{N}(0, 0.25) \quad i = 1, \dotsc, 6$, corresponding, to the week from Sunday to Friday (where the reference date is Saturday), respectively (see day_of_week_effect_prior_modes and day_of_week_effect_prior_scales).
the covariate $\beta_7 \sim \mathcal{N}(-0.2, 0.2)$, since holiday_eff_prior_mode = -0.2 and holiday_eff_prior_scale = 0.2, and refers to the holiday effect.
the covariate $\beta_8 \sim \mathcal{N}(0, 0.2)$, since post_holiday_eff_prior_mode = 0.0 and post_holiday_eff_prior_scale = 0.2, and refers to the post-holiday effect.
the covariate $\beta_9 \sim \mathcal{N}(-50, 0.01)$, since non_obs_effect_prior_mode = -50 and non_obs_effect_prior_scale = 0.01, and refers to the change in the observation probability during the nonobservation (seeding) period.

Within cfaepim, we have the infection to hospitalization delay distribution as

$$\pi = \begin{bmatrix} 0.05 & 0.1 & 0.175 & 0.225 & 0.175 & 0.125 & 0.075 & 0.05 & 0.025\end{bmatrix}$$

given the selected value for inf_to_hosp_dist.

The corresponding code in cfaepim:

build_light_obs <- function(inf_to_hosp_dist,
                            ihr_intercept_prior_mode,
                            ihr_intercept_prior_scale,
                            day_of_week_eff_prior_modes,
                            day_of_week_eff_prior_scales,
                            holiday_eff_prior_mode,
                            holiday_eff_prior_scale,
                            post_holiday_eff_prior_mode,
                            post_holiday_eff_prior_scale,
                            non_obs_effect_prior_mode,
                            non_obs_effect_prior_scale,
                            inv_dispersion_prior_mode,
                            inv_dispersion_prior_scale,
                            link = "logit") {
  return(epidemia::epiobs(
    formula = as.formula(paste0(
      "hosp ~ 1 + day_of_week + ",
      "is_holiday + ",
      "is_post_holiday + ",
      "nonobservation_period"
    )),
    ## Add a covariate for the
    ## nonobservation window to
    ## leave an initial evolution
    ## period with no observations
    i2o = inf_to_hosp_dist,
    link = link,
    family = "neg_binom",
    prior_intercept = rstanarm::normal(
      location = ihr_intercept_prior_mode,
      scale = ihr_intercept_prior_scale
    ),
    prior = rstanarm::normal(
      location = c(
        day_of_week_eff_prior_modes,
        holiday_eff_prior_mode,
        post_holiday_eff_prior_mode,
        non_obs_effect_prior_mode
      ),
      ## a large negative non_obs_effect
      ## effectively conditions on detection
      ## prob = 0 outside the observation period
      scale = c(
        day_of_week_eff_prior_scales,
        holiday_eff_prior_scale,
        post_holiday_eff_prior_scale,
        non_obs_effect_prior_scale
        ## non-obs prior scale
        ## should be small to
        ## enforce non-obs effect
        ## close to (large negative) mode
      )
    ),
    prior_aux = rstanarm::normal(
      location = inv_dispersion_prior_mode,
      scale = inv_dispersion_prior_scale
    )
  ))
}

@AFg6K7h4fhy2 terminology edits only for the post on the $\mathcal{R}(t)$ model.

Comments on the observation component description

Proceeding, each US territory has is own time series of observed influenza hospitalizations $Y = (Y_1, \dotsc, Y_n)$. The observed influenza hospitalizations at time $t$ are given by $$Y_t \sim p(y_t, \phi)$$ where $p(y_t, \phi)$ is the sampling distribution with auxiliary parameter $\phi$. Note that since reciprocal_dispersion_prior_mode = 10 and reciprocal_dispersion_prior_scale = 5, we have $\phi \sim \mathcal{N}(10, 5)$. Since family = "neg_binom", $\phi$ represents the reciprocal dispersion of the sampling distribution.

Should state explicitly that family = "neg_binom" means that $p$ becomes a negative binomial parameterized by its mean and reciprocal dispersion. I also might not use $p$ to denote the distribution, since it often denotes a specific probability. Perhaps something like $\sim \mathcal{D}(y_t, \phi)$ or $\sim \mathrm{Dist}(y_t, \phi)$ if your goal is to indicate that the choice of distribution is not fixed but rather determined by the family argument.

$\alpha_t>0$ represents an instantaneous ascertainment rate, which in this instance means the rate of influenza infections that become hospitalizations at time $t$

Suggest reported hospitalizations

transformed by the logit-link (since link = "logit") function $g$, where $$g^{-1}(x) = \frac{1}{1 + e^{-x}}$$

Suggest modeled by a linear model with a logit link

Within cfaepim, the linear predictor for $\alpha_t$ consists of

linear prediction

We have:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i \end{aligned} $$

Need to include the predictor variables, not just the coefficients:

$$ \begin{aligned} \alpha_t &= g^{-1}(\eta_t)\ \etat &= \sum{i=0}^9 \beta_i x_i \end{aligned} $$

each of the covariates $\beta_1, \dotsc, \beta_6$ have priors, $\beta_i \sim \mathcal{N}(0, 0.25) \quad i = 1, \dotsc, 6$, corresponding, to the week from Sunday to Friday (where the reference date is Saturday)

The $\beta_i$ are not covariates, they are coefficients. Covariate has a more or less precise meaning depending on who is talking, but the closer thing here would be the predictor variables $x_i$, not the corresponding coefficients $\beta_i$.

The reference date is not necessarily Saturday. It is set in the config by the reference_day_of_week entry. The day of week values are then encoded relative to this reference day in the build_state_light_model() function here:

## make sure day_of_week is properly
## set up as a factor
dow_levels <- levels(
lubridate::wday("2023-01-01",
  label = TRUE,
  week_start = params$reference_day_of_week
)
)

clean_data <- clean_data |>
dplyr::mutate(
  day_of_week = factor(day_of_week,
    ordered = FALSE,
    levels = dow_levels
  )
)

@AFg6K7h4fhy2

CDCgov / multisignal-epi-inference