BYM2 island implementation flawed; fix and writeup

mitzimorris commented 3 years ago

The BYM2 islands implementation isn't doing the right thing. I have an implementation which seems to be working here: https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson/update_2021_02

Jupyter notebook, still in progress.

further testing is required, in particular:

run the data through INLA for all 3 Scotland graphs and compare with Stan results
recreate the plots from the Freni-Sterrantino et al 2017 paper and compare

your review and comments welcome. happy to move this over to this repo so that folks can easily find it.

the pain points for the BYM2 model are:

converting GIS data from spatial polygons to R spdep's package nb objects to data arrays used in the Stan programs
computing the scaling factor

I checked a bunch of R scripts and data files into GitHub; computing the scaling factor requires magic from R package INLA. It would be nice to have corresponding Python helpers, using numpy, or SciPy and whatever GIS packages exist. not sure if they do - only INLA has the magic foo to compute the geometric mean of a sparse matrix.

mitzimorris commented 3 years ago

just updated notebook, also minor changes to models to improve readability. outputs of diagnose command show both models fitting the data; interestingly, the islands models requires upping the max_treedepth to 11 (default is 10).

revised printout of notebook here: BYM2 islands.pdf

ConnorDonegan commented 3 years ago

Thanks Mitzi! I've been hoping that some nice stan developer would tell me they've found a way to make the scale factor without INLA (cause INLA will break the dependency structure of my R package). Too bad.

On the scale for singletons from Freni-Sterrantino, I read it to be switching the prior from ~ N(0, 1) to N(0, phi_scale) with phi_scale being the same scale parameter from the ICAR model (where it gets scaled by the data 1/k for k neighbors or whatever).

Is there an R script also? (There doesn't have to be). For the pdf, I notice there's no author listed.

Do you want to make a pull request? If there's a bunch of stuff could you put it into a stand-alone folder?

mitzimorris commented 3 years ago

a way to make the scale factor without INLA

wouldn't that be nice? according to Dan Simpson, no other package provides the equivalent of qinv - https://rdrr.io/github/andrewzm/INLA/man/qinv.html

the problem is the need to keep working with sparse matrices because that's the only thing that scales.

mitzimorris commented 3 years ago

On the scale for singletons from Freni-Sterrantino, I read it to be switching the prior from ~ N(0, 1) to N(0, phi_scale) with phi_scale being the same scale parameter from the ICAR model (where it gets scaled by the data 1/k for k neighbors or whatever).

I think I see this - working on it now.

mitzimorris commented 3 years ago

Is there an R script also? (There doesn't have to be). For the pdf, I notice there's no author listed.

you mean, running this using CmdStanR? will do. can provide both notebooks and scripts in R and Python.

hoping to get insight on why islands model requires upping max_treedepth; for the connected BYM2 model, with the Scotland data, almost all iterations have treedepth 8; for islands model, almost all iterations require treedepth 11. (insert Spinal Tap joke here).

mitzimorris commented 3 years ago

checked in a bunch of fixes; the BYM2_islands model fits the data without problems. added authorship to the writeup and an introduction which summarizes the essentials of the model and recommendations from Freni-Sterrantino. however, Stan model, as implemented, doesn't match INLA results on the same dataset; questions remain. @ConnorDonegan - any insights?

latest is here, including pdf version of notebook - https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson/update_2021_02

ConnorDonegan commented 3 years ago

I wouldn't be so sure the problem is with your code rather than INLA's, your implementation could still be the more reasonable one. I think I'll be able to take a look this weekend

mitzimorris commented 3 years ago

there are a bunch of checks just playing with the dataset -

run the INLA code and compare all estimates - I was just checking the 3 islands as reported in the paper itself, but there's an appendix that gives the INLA forumlas. how do estimates for counties in the south look, e.g., Edinburgh and Glasgow
run the scotland connected data as if it were a multi-component dataset - estimates should be the same from both bym2 and bym2_islands models

drop the islands entirely from the dataset - "mainland only" - compare

I tried a bunch of tweaks to the model and the set of scaling factors - the most sensible one being changing the prior on rho - beta(0.5, 0.5) makes more sense than the weakly informative normal(0,1)

ConnorDonegan commented 3 years ago

I took the lazy path and tried my own code first. I got slightly different results from both of you. I also read over your pdf pretty closely and didn't see anything that looked wrong, though it looks like in that document you have the prior for the partial-pooling parameters on the islands as N(0,1) rather than N(0, spatial_scale_parameter).

They say this about the prior on the precision parameter: "prior π (κ ) (gamma with shape 1 and rate 5e-5)"...not sure why they would do that, but its super diffuse and doesn't really matter. (I did try it to make sure). Otherwise I don't see anything about their priors.

functions {
#include icar-functions.stan
}

data {
  int n;    // no. observations
  int<lower=1> k; // no. of groups
  int group_size[k]; // observational units per group
  int group_idx[n]; // index of observations, ordered by group
  int<lower=1> n_edges; 
  int<lower=1, upper=n> node1[n_edges];
  int<lower=1, upper=n> node2[n_edges];
  int<lower=1, upper=k> comp_id[n]; 
  vector[k] inv_sqrt_scale_factor; // ICAR scale factor, with singletons represented by 1
  int y[n];
  vector[n] offset;
  vector[n] x;
}

transformed data {
  int<lower=0,upper=1> has_theta=1;
}

parameters {
  real beta;
  real alpha;
  vector[n] phi_tilde;
  vector[n] theta_tilde;  
  real<lower=0> spatial_scale;
  real<lower=0,upper=1> rho;
}

transformed parameters {
  vector[n] convolution = convolve_bym2(phi_tilde, theta_tilde, spatial_scale, n, k, group_size, group_idx, rho, inv_sqrt_scale_factor);
  vector[n] eta = offset + alpha + x * beta + convolution;
}

model {
   phi_tilde ~ icar_normal(spatial_scale, node1, node2, k, group_size, group_idx, has_theta);
   theta_tilde ~ std_normal();
  spatial_scale ~ std_normal();
   rho ~ beta(1,1);
   alpha ~ std_normal();
   beta ~ std_normal();
   y ~ poisson_log(eta);
}

generated quantities {
  vector[n] SMR;
  vector[n] yrep;
  for (i in 1:n) {
    SMR[i] = exp(eta[i]) / exp(offset[i]);
    yrep[i] = poisson_log_rng(eta[i]);
  }
}

In R:

pkgs <- c("SpatialEpi", "rstan", "spdep", "INLA")
lapply(pkgs, require, character.only = TRUE)
rstan_options(auto_write = TRUE)
source("icar-functions.R")
data(scotland)

# three islands disconnected
nb <- poly2nb(scotland$spatial.polygon, snap = 1)
C <- nb2mat(nb, style = "B", zero.policy=T)
icar.data <- prep_icar_data(C)
k <- icar.data$k
scale_factor <- vector(mode = "numeric", length = k)
for (j in 1:k) {
  g.idx <- which(icar.data$comp_id == j) 
  if (length(g.idx) == 1) {
    scale_factor[j] <- 1
    next
  }    
  Cg <- C[g.idx, g.idx] 
  scale_factor[j] <- scale_c(Cg) 
}
icar.data$inv_sqrt_scale_factor <- 1 / sqrt( scale_factor )

dl <- list(
    y =  scotland$data$cases,
    offset = log(scotland$data$expected),
    x = 10 * scotland$data$AFF,
    n = nrow(scotland$data)
)

dl <- c(icar.data, dl)

bym2 <- stan_model("bym2-tmp.stan")

fit = sampling(bym2,
               data = dl,
               cores = 6,
               refresh = 250,
               control = list(max_treedepth = 13)
               )

Inference for Stan model: bym2-tmp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha         -0.31       0 0.13 -0.56 -0.39 -0.31 -0.22 -0.05  1732    1
beta           0.44       0 0.14  0.17  0.35  0.44  0.53  0.70  1388    1
spatial_scale  0.54       0 0.09  0.39  0.48  0.53  0.60  0.73  1043    1

Samples were drawn using NUTS(diag_e) at Sun Feb 28 21:37:15 2021.

They report alpha (intercept) = -0.25 and precision k = 3.97 where I got precision0.54^-2=3.43; and they got β (AFF)= 0.37 (SD = 0.13); [95% CI: 0.09, 0.62]. Some of the results I'm getting leave a little more uncertainty, including on the predictions for the islands, than theirs do.

Their map is like a mirror image or something so its kind of annoying to look at the nb plot. But the connections are really sensitive to the distance threshold set. I thought this one looked okay but I'm not sure what they did, and that could be a source of a lot of the discrepancy. Or maybe the loose sum to zero constraint? Seems unlikely but could try tightening it. I'm not inclined to think that their implementation is perfect unless I see something fishy here.

nb <- poly2nb(scotland$spatial.polygon, snap = 1)
> nb
Neighbour list object:
Number of regions: 56 
Number of nonzero links: 234 
Percentage nonzero weights: 7.461735 
Average number of links: 4.178571 
3 regions with no links:
orkney shetland western.isles

mitzimorris commented 3 years ago

I took the lazy path and tried my own code first

not lazy at all!

yes, nb plots for islands don't make sense - islands just float around the edges - don't respect lat/lon

mitzimorris commented 3 years ago

looked at your code - I don't think it's quite right. I've taken another pass at the papers and have updated the notebook, model, and R helpers accordingly. I think the implementation I've got is correct. happy to contribute this here.

also, given plots from all regions, there's plenty of other regions which show as much difference as the set 1,3,5... - but the islands have markedly different density plots

mitzimorris commented 3 years ago

partial-pooling parameters on the islands as N(0,1) rather than N(0, spatial_scale_parameter)

unclear what you mean by "spatial_scale_parameter" - what is this in the model? do we expect this to be very different from 1, and if so, in which direction?

ConnorDonegan commented 3 years ago

"spatial scale parameter" being from this here:

parameters {
  real beta;
  real alpha;
  vector[n] phi_tilde;
  vector[n] theta_tilde;  
  real<lower=0> spatial_scale;
  real<lower=0,upper=1> rho;
}

rather than the "scale_factor" given as data. In that paper, they use k for the precision parameter

So in this case it would be like N(0, 0.54)

ConnorDonegan commented 3 years ago

Can you share a link to the updated version of the notebook?

mitzimorris commented 3 years ago

everything is checked into github - in repo 'example-models' working on master branch.

https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson/update_2021_02

ConnorDonegan commented 3 years ago

Sorry for such a long delay, I'll get back on this really soon!

ConnorDonegan commented 3 years ago

Alright I've been working with both implementations to see how they differ.

The convolve_bym2 function removes phi from the convolution if there are no neighbors:

   if (group_size[j] == 1) {
        convolution[ segment(group_idx, pos, group_size[j]) ] = spatial_scale * theta_tilde[ segment(group_idx, pos, group_size[j]) ];
    }

because by definition there is no spatial trend (no Markov random field) when you have no neighbors. Or, you could say that the proportion spatial rho must equal zero, which should be the same as the above code.

The corresponding code from your model adds phi to theta for the islands:

  vector[I] gamma;
  for (k in 1:K)
    gamma[K_node_idxs[k, 1:K_node_cts[k]]] = 
      (sqrt(1 - rho) * theta[K_node_idxs[k, 1:K_node_cts[k]]]
       +
       sqrt(rho / tau[k]) * phi[K_node_idxs[k, 1:K_node_cts[k]]])
      * sigma;
}

since phi[K_node_idxs[k, 1:K_node_cts[k]]] includes the index of the island (phi[ island_index ]). And the parameter phi[ island_index ] is given N(0, 1) here:

    for (n in 1:num_comps) {
         if (node_cts[n] > 1)
    total += -0.5 * dot_self(phi[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -
                 phi[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])
      + normal_lpdf(sum(phi[node_idxs[n, 1:node_cts[n]]]) | 0, 0.001 * node_cts[n]);
          else
              total += normal_lpdf(phi[node_idxs[n, 1]] | 0, 1); // this is the N(0, 1) on phi[ island_index ] 
    }

I realize now that I wasn't communicating correctly about this part. What happens to the island term depends on the model specification. If you're just fitting an MRF over some graph structure (there is no theta), then the way to use partial pooling on the disconnected nodes is to set the prior on that node to N(0, spatial_scale). But with the convolution term theta + phi (or similar), including phi at all is redundant in the sense that both phi and theta represent the same information. So what I learned from your code earlier was that we can drop phi from the convolution term whenever necessary rather than set it to zero (though I do that too just to be thorough about it and prevent anyone from making mistakes when looking at the vector phi)

ConnorDonegan commented 3 years ago

also, the first thing I did was try to get the two models on the same page (as far as was apparent to me) and see if the results match, making sure that they're doing the indexing the same way; after a couple little changes to the priors, they became identical except for a small difference in the posterior means of the convolution term for the islands. That's when I started looking at the above stuff on how the islands are being handled in each model. So I'm happy to update what you have here as is and just note any differences between the implementations, but will wait to hear from you

mitzimorris commented 3 years ago

hi Connor, will take a look at this soon - working on something else right now.

ConnorDonegan / Stan-IAR

BYM2 island implementation flawed; fix and writeup #3