Refactor hierarchical site-level estimation to estimate `n_subpops-1` deviations from the "reference" subpopulation

kaitejohnson commented 2 months ago

Edited to reflect additional intercept term and propose optional patchwork:

Goal

This should address #136, but also a more fundamental issue brought up by @damonbayer and @sbidari, which is that the current version of the "hospital admissions only" model is still going to estimate n_subpops infection dynamics, even though this will not be informed by wastewater concentration data and should not have a patchwork or hierarchical population structure. Instead, the desired behavior here is that:

the "hospital admissions only" model reverts to estimating the R(t) of a single reference population (here the total population). This will then approximate the renewal_ww_hosp.stan model in the production/eval repo. Here in effect there will be $n_{\mathrm{sites}} = 0$, and the population of the "auxiliary site of those not captured by wastewater" is the total population, $n$
when $\sum\nolimits{k=1}^{K\mathrm{sites}} n_k \ge n$ we want to estimate n_subpops = n_sites R(t) estimates, whose per capita infections generate expected counts in the total pop. Here there will be no "auxiliary site", and the "reference" subpopulation will just be the largest wastewater catchment area.
when $\sum\nolimits{k=1}^{K\mathrm{sites}} n_k < n$ (the standard case) we will estimate the site level R(t) estimates as deviations from the reference subpopulation, which is by default the auxiliary site (the subpopulation not captured by wastewater).

This results in the following changes to the model definition:

Currently, we estimate a global undamped effective reproductive number $\mathcal{R}^\mathrm{u}(t)$. We now instead will estimate a reference effective reproductive number $\mathcal{R}^0(t)$ with $K_{\mathrm{subpops}}-1$ deviations from the reference.

The number of subpopulations falls under a few distinct cases:

there are $n{\mathrm{sites}}$, $\sum\nolimits{k=1}^{K_\mathrm{sites}} nk < n$, therefore $K{\mathrm{subpops}} = n{\mathrm{sites}}+ 1$. If $n{\mathrm{sites}} = 0$, i.e. there is no wastewater data, then there is only $K_{\mathrm{subpops}} =1$ and it is the reference subpopulation. The reference is the subpopulation not covered by wastewater by default
there are $n{\mathrm{sites}}$, $\sum\nolimits{k=1}^{K_\mathrm{sites}} nk > n$, therefore $K{\mathrm{subpops}} = n_{\mathrm{sites}}$. The reference effective reproductive number is the largest wastewater catchment area
there is no wastewater data, but the user can specify additional subpopulations $K{\mathrm{subpops}} =n{\mathrm{subpops}}$ and which of them is the reference (@seabbs I think I'd want to do this in a separate PR, but knowing it is a case we would want to be able to configure will be helpful).

Thus we have the following proposed rewrite to the "Subpopulation level infections":

Subpopulation-level infections

We couple the subpopulation and total population infection dynamics at the level of the un-damped instantaneous reproduction number $\mathcal{R}^\mathrm{u}_ {0}(t)$.

We model the subpopulations as having infection dynamics that are similar to one another but can differ from the reference dynamic.

We represent this with a hierarchical model where we estimate a reference un-damped effective reproductive number $\mathcal{R}^\mathrm{u} {0}(t)$ and then estimate the individual subpopulation $k$ deviations from the reference value, $\mathcal{R}^{\mathrm{u}}{k}(t)$

The reference value for the undamped instantaneous reproductive number $\mathcal{R}^\mathrm{u}_0(t)$ follows the time-evolution described above. Subpopulation deviations from the reference reproduction number are modeled via a log-scale AR(1) process. Specifically, for subpopulation $k$:

$$ \log[\mathcal{R}^\mathrm{u}_{k}(t)] = \log[\mathcal{R}^\mathrm{u}_0(t)] + m +\delta_k(t) $$

where $m$ is an "intercept" for the reference subpopulation, which is a fixed inferred parameter and allows for the fact that $\log[\mathcal{R}^\mathrm{u}_0(t)]$ may be the reference value but doesn't have to be the central value.

$\deltak(t)$ is the time-varying subpopulation effect on $\mathcal{R}^\mathrm{u} 0(t)$, modeled as,

$$\deltak(t) = \varphi{R(t)} \deltak(t-1) + \epsilon{kt}$$

where $0 < \varphi{R(t)} < 1$ and $\epsilon{kt} \sim \mathrm{Normal}(0, \sigma_{R(t)\delta})$.

We chose a prior of $\varphi{R(t)} \sim \mathrm{beta}(2,40)$ to impose limited autocorrelation in the week-by-week deviations. We set a weakly informative prior $\sigma{R(t)\delta} \sim \mathrm{Normal}(0, 0.3)$ to allow for either limited or substantial site-site heterogeneity in $\mathcal{R}^\mathrm{u}_ 0(t)$, with the degree of heterogeneity inferred from the data.

@dylanhmorris let me know if this reflects accurately our conversation, and if others agree with this approach.

@gvegayon @SamuelBrand1 @seabbs would love your thoughts as well

dylanhmorris commented 2 months ago

Looks largely good, but I think it can be simplified by writing more in terms of subpopulations (which may or may not have observed wastewater) and less in terms of wastewater sites.

Currently, we estimate a global undamped effective reproductive number $\mathcal{R}^\mathrm{u}(t)$. We now instead will estimate a single reference effective reproductive number $\mathcal{R}^0(t)$ with $K{\mathrm{subpops}}-1$ deviations from the reference in the case of the wastewater informed model, where $K{\mathrm{subpops}} = n{\mathrm{sites}} +1$ if $\sum\nolimits{k=1}^{K_\mathrm{sites}} nk < n$ and $K{\mathrm{subpops}} = n{\mathrm{sites}}$ if $\sum\nolimits{k=1}^{K_\mathrm{sites}} n_k > n$ or if there are no sites (in which case, no deviations are estimated).

Also, note that "if there are no sites (in which case, no deviations are estimated)" is a case of $K{\mathrm{subpops}} = n{\mathrm{sites}} +1$ if $\sum\nolimits{k=1}^{K\mathrm{sites}} nk < n$ with $K{\mathrm{subpops}} = 0 + 1 = 1$, so this sentence needs revision regardless:

where $K{\mathrm{subpops}} = n{\mathrm{sites}} +1$ if $\sum\nolimits{k=1}^{K\mathrm{sites}} nk < n$ and $K{\mathrm{subpops}} = n{\mathrm{sites}}$ if $\sum\nolimits{k=1}^{K_\mathrm{sites}} n_k > n$ or if there are no sites (in which case, no deviations are estimated).

dylanhmorris commented 2 months ago

Also, I thought we had discussed additionally inferring an intercept for the "reference" subpopulation $\mathcal{R}(t)$:

$$ \log[\mathcal{R}^\mathrm{u}_{k}(t)] = \log[\mathcal{R}^\mathrm{u}_0(t)] + m + \delta_k(t) $$

Where $m$ is fixed inferred parameter and allows for the fact that $\log[\mathcal{R}^\mathrm{u}_0(t)]$ may be the reference value but doesn't have to be the central value (except in the special case of 1 single (sub)population

seabbs commented 2 months ago

ation data and should not have a patchwork or hierarchical population structure.

I think this should be could not and not should not. There are lots of reasons one might want a patch based outbreak regardless of your possession of ww data.

kaitejohnson commented 2 months ago

Also, I thought we had discussed additionally inferring an intercept for the "reference" subpopulation R ( t ) :

log ⁡ [ R k u ( t ) ] = log ⁡ [ R 0 u ( t ) ] + m + δ k ( t )

Where m is fixed inferred parameter and allows for the fact that log ⁡ [ R 0 u ( t ) ] may be the reference value but doesn't have to be the central value (except in the special case of 1 single (sub)population

Whoops yes will edit and add this! Also updated to clarify based on number of subpopulations

kaitejohnson commented 2 months ago

ation data and should not have a patchwork or hierarchical population structure.

I think this should be could not and not should not. There are lots of reasons one might want a patch based outbreak regardless of your possession of ww data.

Would your thought be that the user can specify the number of patches in the absence of "patched" data @seabbs ?

dylanhmorris commented 2 months ago

Would your thought be that the user can specify the number of patches in the absence of "patched" data @seabbs ?

One might have other subpopulation-based observables (e.g. subpopulation level admissions data) or aggregate-level ww. Supporting this in the package is beyond the scope of this particular change/feature, but relevant to how we think/write up about the model.

CDCgov / ww-inference-model

Refactor hierarchical site-level estimation to estimate `n_subpops-1` deviations from the "reference" subpopulation #149

Goal

Subpopulation-level infections