biodiverse / ubms

Fit models to data from unmarked animals using Stan. Uses a similar interface to the R package 'unmarked', while providing the advantages of Bayesian inference and allowing estimation of random effects.
https://hmecology.github.io/ubms/
GNU General Public License v3.0
35 stars 8 forks source link

Random slopes not woring #73

Closed tapirus07 closed 9 months ago

tapirus07 commented 11 months ago

When I read ubms paper in Methods in Ecology and Evolution, it was clear for that it is possible to estimate random slopes, not only random intercepts, even though I just found examples with random intercepts.

When I try the model with only random intercet, it runs perfectly: stan_occu(~method+hort+wash+(1|parasite)+(1|id)~hort+type+(1|parasite)+(1|id),input,chains=4,iter=4000,cores=4,control=list(adapt_delta=.90))

But when I include any random slope (I want to see effect to very among parasites), it fails: stan_occu(~method+(1+method|parasite)+(1|id)~hort+(1+hort|parasite)+(1|id),input,chains=4,iter=4000,cores=4,control=list(adapt_delta=.90))

Error: Failed to create random effects model matrix. Possible reasons: (1) Correlated slopes and intercepts are not supported. Replace | with || in your formula(s). (2) You have specified random slopes for an R factor variable. Try converting the variable to a series of indicator variables instead.

Could you gently guide me. I appreciate any advice. G.

kenkellner commented 11 months ago

See the first error message:

(1) Correlated slopes and intercepts are not supported. Replace | with || in your formula(s).

You need to change your formula to

~method+(1+method||parasite)+(1|id)~hort+(1+hort||parasite)
tapirus07 commented 11 months ago

I did exacctly this, but I still receive the same error

kenkellner commented 11 months ago

Well, what about the 2nd message? Is either method or hort a categorical/factor covariate?

tapirus07 commented 11 months ago

Method and hort are two factor variable:

Method (hoffman and faust) Hort (lettuce, aragula, parsley)

I am modeling the occupancy of intestinal parasites on vegetables selled in urban garden plots. Each vegetable were searched through 8 microscope slides/samples using two detection methods (hoffman and faust). Since we have imperfect detection, I though it was a good idea to use the occupancy framework to model prevalence. Rather to use a naive estimation. Still, rather than to run one model for parasite, I am trying to use parasite as random efefct.

You reccomend to create a dummy variable?

kenkellner commented 11 months ago

Yes, I would replace both with dummy variables. It's a bit clunky but I haven't figured out a workaround yet. Also, you'll still need to replace | with ||.

tapirus07 commented 11 months ago

About the dummy variable. Faust and hoffman methods were inserted as a matrix "ObsCovs". I have 8 sampling occasions for each vegetable: the first 4 microscope slides came from faust method (occasion 1 to 4), and 4 last (occasion 5 to 8) come from hoffman method.

I did not get how to do it as dummy. Some advice?

tapirus07 commented 11 months ago

I appreciate a simple example using dummy variables as site covariates, and as obs covariates.

kenkellner commented 11 months ago

See below

# simulate dataset

M <- 270 # "sites"
J <- 8   # obs

# response
y <- matrix(rbinom(M*J, 1, 0.5), M, J)

# random factor (parasite)
parasite <- as.factor(rep(letters[1:3], each=M/3))

# hort
hort <- as.factor(rep(c("lettuce", "arugla", "parsley"), M/3))

# site covs
site_covs <- data.frame(hort = hort, parasite = parasite)

# method
method <- cbind(matrix("hoffman", M, J/2), matrix("faust", M, J/2))
head(method)

# obs covs
obs_covs <- list(method = method)

library(ubms)

umf <- unmarkedFrameOccu(y = y, siteCovs = site_covs, obsCovs = obs_covs)

# Doesn't work
mod <- stan_occu(~method + (1 + method|parasite) ~hort + (1+hort|parasite), umf)

# Create dummy variables
# argula is reference class by default
# 3 levels, so 2 dummy variables required (called lettuce and parsley)
hort_dummy <- model.matrix(~hort, site_covs)[,2:3]
colnames(hort_dummy) <- c("lettuce", "parsley")

site_covs2 <- data.frame(hort_dummy, parasite = parasite)

# set hoffman to reference class
# there's only two levels so only one dummy variable is needed (called faust)
# Just replacing "hoffman" with 0 and "faust" with 1, now it's a dummy variable
faust_dummy <- apply(method, c(1,2), function(x) as.numeric(x == "faust"))

obs_covs2 <- list(faust = faust_dummy)

umf2 <- unmarkedFrameOccu(y = y, siteCovs = site_covs2, obsCovs = obs_covs2)

# Change formulas to use the new dummy covs
# this still won't work because ubms can't do correlated random slopes and intercepts
# which is what | implies in lme4 syntax
mod2 <- stan_occu(~faust + (1 + faust|parasite) ~lettuce + parsley + (1+lettuce+parsley|parasite), umf2)

# now fix formulas by changing | to || to indicate non-correlated random slopes and intercepts
# this will run
mod2 <- stan_occu(~faust + (1 + faust||parasite) ~lettuce + parsley + (1+lettuce+parsley||parasite), umf2,
                  chains = 3, iter = 300, warmup = 200)
tapirus07 commented 11 months ago

I am very thankfull. It is working.

tapirus07 commented 11 months ago

Ken,

An additional question related to random effects or more complex nested structures. I already understood that nested structure is not available. I sampled 50 urban garden plots, in each garden I collected 3 replicates of 3 vegetable species (arugula, lettuce, parsley). For each sampled vegetable I looked for 8 intestinal parasite species.

I would say I have a nested structure as parasites are nested in the sampled vegetable, and the vegetables are nested in the garden plots. (1|id/parasite).

In my model I included (1+effects||parasite)+(1|gardenID) because nested structure was not possible. It worked perfectly after your advice. Now I would like to summarize the results in some graphs. I am mostly interested in the parasite differences in prevalence. GardenID is just a statistical control to make sure vegetables are compared within each garden plot.

However, when I use predict, especially the re.form argument, I was not able to estimate predictions for different parasites (parasite group-level), ignoring the gardenID group-level. If I set re.form=NA, predictions ignore both group levels. There are some ways to make predictions considering just parasite level, and average/ignore gardenID level? Like population predictions for each parasite.

kenkellner commented 11 months ago

At present predict in ubms requires you to specify some specific existing level for every random factor, as you discovered. I believe this is generally how similar R packages such as lme4 work - you either have to completely ignore all random effects, or keep them all.

If I am understanding correctly, you basically want the value of the gardenID random effect to be fixed at 0. I think the only way you can currently do this is by calculating the predictions yourself manually. To do that you will need the full posterior distributions of the model parameters, and then using these actually calculate the linear predictor only including the terms you are interested in.

This requires a good understanding of the structure of the model, and also dealing with the underlying Stan object from ubms which is not necessarily very user friendly. You can get the full set of posterior samples for all parameters using the extract function.

On the other hand, I don't think it's such a bad thing to just make the plots while holding gardenID at some arbitrarily-chosen level. Yes, the overall predicted occupancy (or whatever) will shift up and down depending on which specific gardenID you choose, but if you manually force this random effect to be 0, you're creating a scenario that couldn't actually happen in real life (because you always have to be in some garden).

tapirus07 commented 11 months ago

Thanks. Trying to deal with it. Just to inform, nlme package allows to choose the grouping level to perform predictions, and lme4 allows to include a formula in re.form argument, so you can also make populational predictions for different group levels. I got your point about to force artificially a random effect to 0. However managers need some summary stats like what is the overall prevalence of a intestinal parasite species by vegetable for a city.

Thanks for your time and helping.