prep_icar_data() issues with multiple singletons and singleton-first connectivity matrices (plus possible fix?)

amytims commented 8 months ago

I've been trying to run some spatial models with multiple singletons in the matrix, and found that the stan_icar() function was giving me errors. I think I've traced the problem to the prep_icar_data() function, which sometimes returns matrices with unexpected dimensions, throwing a mismatch in dimension declared and found error when stan_icar() is run.

The source of the problem is in the prep_icar_data() function at the line ID[which(ID == change.to.one)] <- 1. If you have more than one singleton in your connectivity matrix (i.e., if change.to.one has length > 1), it doesn't change the values in ID properly, so the next couple of lines where the matrix is supposed to have singletons removed don't function as intended:

A = model.matrix(~factor(ID)) A <- as.matrix(A[, -1])

When this happens inside the icar_stan() function, it throws the following error and warning message:

Error : Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=A; position=1; ... In addition: Warning message: In ID == change.to.one : longer object length is not a multiple of shorter object length

I think changing to ID[which(ID %in% change.to.one)] <- 1 would fix this so all singletons get changed to 1.

Also, I noticed that these lines in prep_icar_data() behave differently depending on whether your first group is a singleton or not. ID[which(ID == change.to.one)] <- 1 A = model.matrix(~factor(ID)) A <- as.matrix(A[, -1])

If it isn't, the code removes the first set of connected points as well as all the singletons that get lumped into group 1, giving A a size of n x m, where n = nrow(C) and m = sum(group_size > 1) - 1, as defined within the function. However, if the first group is just a singleton, only the singletons are removed and all larger groups remain, giving A a size of n x m+1, and triggering the dimension mismatch error in icar_stan():

Error : Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=A; position=1; dims declared=(n,m); dims found=(n,m+1) (in 'string', line 480, column 2 to column 17) failed to create the sampler; sampling not done

You can find the group number of the first non-singleton group with group <- ID[which(rowSums(as.matrix(C))!=0)[1]]

and change the singletons to match that group number instead of defaulting to 1 with ID[which(ID %in% change.to.one)] <- group

Then when you run A = model.matrix(~factor(ID)) A <- as.matrix(A[, -1])

it'll remove the first column with all singletons and the members of the first cluster, which will give A the right dimensions and avoid triggering the dimension mismatch error

ConnorDonegan commented 8 months ago

Thanks for sharing this!

I see the problem. Is this what you're suggesting then?

prep_icar_data <- function (C, inv_sqrt_scale_factor = NULL) {
  n <- nrow(C)
  E <- edges(C)
  G <- list(np = nrow(C), from = E$node1, to = E$node2, nedges = nrow(E))
  class(G) <- "Graph"
  nb2 <- spdep::n.comp.nb(spdep::graph2nb(G))
  k = nb2$nc
  if (inherits(inv_sqrt_scale_factor, "NULL")) inv_sqrt_scale_factor <- array(rep(1, k), dim = k)
  group_idx = NULL
  for (j in 1:k) group_idx <- c(group_idx, which(nb2$comp.id == j))
  group_size <- NULL
  for (j in 1:k) group_size <- c(group_size, sum(nb2$comp.id == j))
  # intercept per connected component of size > 1, if multiple.
  m <- sum(group_size > 1) - 1
  if (m) {
   ############################## changes here
    GS <- group_size
    ID <- nb2$comp.id
    # identify the first non-singleton part of the graph
    main.group <- ID[which(rowSums(as.matrix(C))!=0)[1]]
    # lump singletons together with main.group
    change.group <- which(GS == 1)
    ID[which(ID %in% change.group)] <- main.group
    # create dummy variables for additional parts of the graph
    A = model.matrix(~ factor(ID))
    # drop the constant term
    A <- as.matrix(A[,-1])
  ############################## 
  } else {
    A <- model.matrix(~ 0, data.frame(C))
  }
  l <- list(k = k, 
            group_size = array(group_size, dim = k), 
            n_edges = nrow(E), 
            node1 = E$node1, 
            node2 = E$node2, 
            group_idx = array(group_idx, dim = n), 
            m = m,
            A = A,
            inv_sqrt_scale_factor = inv_sqrt_scale_factor, 
            comp_id = nb2$comp.id)
  return(l)
}

I'm going to update the same function in the geostan R package too. Are you interested in submitting a pull request there (so that your contribution is recorded)? https://github.com/ConnorDonegan/geostan/blob/main/R/convenience-functions.R

amytims commented 8 months ago

Hi Connor,

Yeah, I think that'll work :) I'll submit the pull request now.

Thanks so much for all your work on this. I'm not great with stan syntax so the R helper functions have been a lifesaver for me!

Cheers, Amy

ConnorDonegan commented 8 months ago

I'm glad its useful for you! I updated the function in icar-functions.R here, and your pull request will be released with geostan 0.6.0, maybe within a few weeks. Thanks again for your contribution!

ConnorDonegan / Stan-IAR

prep_icar_data() issues with multiple singletons and singleton-first connectivity matrices (plus possible fix?) #4