ScottClaessens / coevolve

coevolve R package for Bayesian dynamic coevolutionary models using Stan
GNU General Public License v2.0
7 stars 0 forks source link

Repeated measures for species: conceptual problems with the random effects #84

Open ErikRingen opened 3 days ago

ErikRingen commented 3 days ago

I was thinking more about the use of specific random effect parameters for repeated measures at the species level, and I am not sure it makes sense given the rest of the model. Because each tip species already gets its own parameter eta with phylogenetically independent drift. So just thinking from a generative modelling perspective, does it really make sense that species change along the terminal branch, and then some extra species-specific variance is added on top of that? I don't think ignoring repeated measures is problematic at the species level because each species already has its own parameter with unique drift terms. The dynamical model is not really amenable to the same type of variance decomposition as GLMMs.

In contrast, if there were repeated measures of the same individual within species, or different subgroups/populations measured within a species, we would want to have random effects for that nested structure.

ScottClaessens commented 2 days ago

Hmm, so are you basically saying that even without the additional random effect, the model implies some species-specific variance anyway, due to the terminal drift?

I guess I'm viewing it more like a standard phylogenetic multilevel model, where the extra random effect just captures all the species-specific variance that is not due to phylogenetic history. Like what Paul says in his brms vignette on the topic here.

ErikRingen commented 2 days ago

Yeah in standard PGLMMs you can partition this way because there is no explicit time series just a covariance matrix. But the change that happens on the terminal segments is already species-specific variance that is not shared with other species, so I don't think it makes sense conceptually. Have you assessed how well such models fit in practice?

ScottClaessens commented 2 days ago

I've only tested a few. The model from the vignette seems to fit reasonably well:

coev_fit(
  data = repeated$data,
  variables = list(
    x = "normal",
    y = "normal"
  ),
  id = "species",
  tree = repeated$phylogeny,
  # additional arguments for cmdstanr
  parallel_chains = 4,
  iter_warmup = 2000,
  iter_sampling = 2000,
  refresh = 0,
  seed = 1
)
ErikRingen commented 2 days ago

Thanks for sharing the example. It converges fine but the results are telling of the problem. Imagine first we just took the species means:

repeated_mean <- repeated$data %>% 
  group_by(species) %>% 
  summarise(x = mean(x), y = mean(y))

fit_mean <- coev_fit(
  data = repeated_mean,
  variables = list(
    x = "normal",
    y = "normal"
  ),
  id = "species",
  tree = repeated$phylogeny,
  # additional arguments for cmdstanr
  parallel_chains = 4,
  iter_warmup = 250,
  iter_sampling = 250,
  seed = 1
)

In this model, there is a strong drift correlation and a cross-selection effect from x -> y. However, if you fit your example above, the species random effect correlation consumes both of those effects, driving them to zero. In contrast, this does not happen if you fit a typical PGLMM in brms:

library(brms)
A <- ape::vcv.phylo(repeated$phylogeny, corr = T)

model_repeat1 <- brm(
  bf(mvbind(scale(x), scale(y)) ~ (1|sp_rep) + (1|p|gr(species, cov = A))),
  data = repeated$data %>% mutate(sp_rep = species),
  family = gaussian(),
  data2 = list(A = A),
  chains = 4, cores = 4,
  iter = 500, warmup = 250,
  backend = "cmdstanr"
)
ScottClaessens commented 2 days ago

Huh, okay I see! So, what do we think the solution is? We could just remove the random effect.

ErikRingen commented 2 days ago

I still like the functionality. Perhaps we could allow the grouping variables to be added as named variables rather than automatically detected? And then generate random intercepts for each?

ScottClaessens commented 2 days ago

Hmm, with arbitrary group effects, we start getting into the kind of territory where it would be preferable to have actual formulas for each response variable. Like brms, we would add random effects when (1 | group) and could even add arbitrary predictor variables. But this starts to get complicated quite quickly, and would require a change in the way we have set things up.

But yes, as you say, it could just be a vector/list of named categorical variables in the dataset:

coev_fit(
  data = d,
  variables = list(
    x = "bernoulli_logit",
    y = "normal"
  ),
  id = "id",
  tree = tree,
  random = c("group1", "group2")
)
ErikRingen commented 2 days ago

Yeah I dont think we should accommodate the more complex formulas at this time. The syntax up propose above looks good, perhaps we disable the repeated measures for now until we can implement the new system.

ScottClaessens commented 1 day ago

Okay, sounds good. I'll disable the repeated measures functionality first, then implement these changes later.