Returning log-likelihood optional to avoid huge model files - Githubissues

biodiverse / ubms

Fit models to data from unmarked animals using Stan. Uses a similar interface to the R package 'unmarked', while providing the advantages of Bayesian inference and allowing estimation of random effects.

https://hmecology.github.io/ubms/

GNU General Public License v3.0

35 stars 8 forks source link

Returning log-likelihood optional to avoid huge model files #66

Closed jniedballa closed 2 weeks ago

jniedballa commented 2 years ago

Hi Ken, saving a large stan_occu() model (>100k sites, 6 occasions, 2500 iterations without warmup) results in an .RData file of almost 10Gb size, a bit unwieldy on an average computer and also very slow to process (even printing the model summary can take several minutes). It is caused by the posterior samples for log_lik in each chain within model@stanfit@sim$samples. There are around 100 actual model parameters in my example, but almost 90,000 log_lik parameters.

Returning log_lik is currently hard-coded and one cannot pass the pars argument to stan() via ... since stan_occu() defines it already. Removing log_lik from the posterior samples manually breaks print() and possibly other methods, so that's not an option either.

So, could computing / returning log_lik be made optional please to avoid such huge model files? From what I understand Stan does not automatically compute and store the log-likelihood, so it might be nice if ubms would also give that option (returning log_lik should be on by default though in my opinion).

Thank you!

kenkellner commented 2 years ago

This definitely can be a problem. It's not quite as simple as just dropping log_lik from the parameter list, since it is required for loo/waic and many functions as you note are expecting that output. You might want to take a look at this pull request - as part of work on K-fold cross validation I wrote code to calculate the log likelihood outside Stan (meaning it doesn't need to be stored in the model object). I haven't merged it because I wanted to test it a bit more but got distracted with other things.

jniedballa commented 2 years ago

Thank you. Sampling concluded, but then stan_occu() failed with an error: Error in lp + Z %*% t(b) : non-conformable arrays I only tried with the big model containing several random effects and did not investigate further. Please let me know if I can help find the issue.

kenkellner commented 2 years ago

Thanks for checking it out, good thing I didn't merge it! I'll do some tests on my end first to see if I can replicate.

kenkellner commented 2 years ago

I can't replicate it. I've tried models with multiple random effects, random intercepts and slopes, etc. Can you describe the model structure and/or maybe show the formula setup you are using so I can try to match it more closely?

jniedballa commented 2 years ago

The error seems to only occur when there is at least one NA in the detection history and there is a random effect in the detection part of the model. Example below:

library(ubms)
set.seed(123)
dat_occ <- data.frame(x1=rnorm(500))
dat_p <- data.frame(x2=rnorm(500*5))

y <- matrix(NA, 500, 5)
z <- rep(NA, 500)

b <- c(0.4, -0.5, 0.3, 0.5)

re_fac <- factor(sample(letters[1:26], 500, replace=T))
dat_occ$group <- re_fac
re <- rnorm(26, 0, 1.2)
re_idx <- as.numeric(re_fac)

idx <- 1
for (i in 1:500){
  z[i] <- rbinom(1,1, plogis(b[1] + b[2]*dat_occ$x1[i] + re[re_idx[i]]))
  for (j in 1:5){
    y[i,j] <- z[i]*rbinom(1,1, 
                          plogis(b[3] + b[4]*dat_p$x2[idx]))
    idx <- idx + 1
  }
}

umf <- unmarkedFrameOccu(y=y, siteCovs=dat_occ, obsCovs=dat_p)

# add  NA
y[1,1] <- NA
dat_p$x2[1] <- NA

umf_na <- unmarkedFrameOccu(y=y, siteCovs=dat_occ, obsCovs=dat_p)

# works when no NAs present
(fm4 <- stan_occu(~ (1|group)  ~1,  umf, chains=3, cores=3, iter = 200))

# fails with one NA present
(fm4 <- stan_occu(~ (1|group)  ~1,  umf_na, chains=3, cores=3, iter = 200))

Error in lp + Z %*% t(b) : non-conformable arrays

kenkellner commented 2 years ago

That's very helpful, thanks. Of course it's an NA thing.

kenkellner commented 2 years ago

This should be fixed in 4cef59d0d2.

jniedballa commented 2 years ago

Thank you, it works now. The saved models are also much smaller (about 20% of what it was in the big model mentioned above).

kenkellner commented 2 years ago

Great. I've also now added an argument log_lik which controls whether the posteriors of the parameter are saved.

jniedballa commented 2 years ago

Nice, thank you very much. Does log_lik being FALSE by default now affect other functions in the package, or break existing code?

kenkellner commented 2 years ago

setting log_lik to FALSE causes ubms to calculate the log-likelihood from the posteriors, rather than saving and using the values from Stan. So, it shouldn't affect other functions (although there may be very small differences in values due to rounding etc). Models with a spatial random effect will ignore the argument since they aren't supported yet, same thing with stan_distsamp.