JenniNiku / gllvm

Generalized Linear Latent Variable Models
https://jenniniku.github.io/gllvm/
48 stars 20 forks source link

se in predict? #24

Open jebyrnes opened 3 years ago

jebyrnes commented 3 years ago

HI! Loving this. Trying to put together something for a manuscript that shows the cover of different species based on models, but, I notice predict.gllvm() doesn't have an option to get fit error on predictions. Or prediction error (although I know that's a different thing altogether!)

Is this possible to extract fit CIs or just SEs?

BertvanderVeen commented 3 years ago

Hello! Standard errors for the coefficients (not the prediction) are present in model$sd if the model converged and the standard error argument was set to TRUE. Alternatively, confidence intervals for parameters can be extracted with confint(model) under the same conditions.

The standard errors can of course be used to construct intervals. However, how should the latent variables be treated? I suppose this depends on the question at hand, one way to go would be to marginalize over the random-effects.

Currently, predictions from predict(.) are performed conditional on the estimated site scores (VA means or last best realization from the Laplace approximation), meaning that standard errors for site scores of the latent variables would need to be constructed, similarly as now is the case in ordination diagrams I suppose. It's unclear to me how (VA) covariances between sites (and/or latent variables) would be used in this..

So, @jebyrnes, I would say, if you're interested in the former (marginalizing), which probably makes sense if you are interested in some more generic phenomena rather than predictions for specific sites, it might be faster for the moment to cook something up in R yourself (unless @JenniNiku happens to be working on implementing this!).

JenniNiku commented 3 years ago

Calculating errors for the predictions/ predicted means is definitely in the TO DO-list. However, it is not yet completely clear to me how I carry out this, so probably this will not be implemented in the very near future, unfortunately.

jebyrnes commented 3 years ago

Ah - to clear up, is it at least posibble to get the vcov between coefficients? vcov() doesn't give anything back.

JenniNiku commented 3 years ago

I think it is available in fit$Hess$cov.mat.mod. Actually it includes all parameters and the corresponding parameters are these fit$TMBfn$par[fit$Hess$incl], but it might be hard to know what parameters these are as all of them are not named similarly to the parameter names in fit$params. But I noticed that I didn't return this for LA method so I will fix that as well.

mhensel commented 3 years ago

Hi @JenniNiku Working with @jebyrnes on this... thanks for all your help! Very much enjoying the package as a whole.

Do you know of a straightforward way to figure out where the vcov between coefficients is inside of fit$Hess$cov.mat.mod? It is indeed hard to know what parameters are what. I'm sleuthing my way through it now but it's slow, so wanted to check with you. Thanks!

BertvanderVeen commented 3 years ago

@mhensel which coefficients do you mean? If fixed-effects, they are the b's, but beware: that also includes species intercepts.

mhensel commented 3 years ago

thanks @BertvanderVeen. I'm looking for the variance-covariance matrix for the $Xcoef s. Unless I'm missing something, fit$Hess$cov.mat.mod doesnt have any headings. I see the b's, phi, and lamdas in fit$TMBfn$par[fit$Hess$incl], but not for $Xcoefs.

JenniNiku commented 3 years ago

Xcoefs are in b:s. For example parameters named "b" are for a model with two covariates a vector of c(beta0[1], Xcoefs[1,1], Xcoefs[1,2], beta0[2], Xcoefs[2,1], Xcoefs[2,2], beta0[3], Xcoefs[3,1], Xcoefs[3,2], ....) if beta0[j], j=1, ..., m, are species specific intercepts,

BertvanderVeen commented 3 years ago

Assigning the names of the parameters to the Hessian will help your search; colnames(fit$Hess$cov.mat.mod) <- names(fit$TMBfn$par[fit$Hess$incl), as well as the rows.

mhensel commented 3 years ago

Following up on this, in my manuscript I reported: (in the results text) the difference in the trace of the residual covariance matrix from a null (no environmental variables) model to a model with environmental variables, as recommended by the Niku et al vignette; 1 - covnoEnvmod$trace / covEnvmod$trace, where the covEnvmod is the residual covariance; getResidualCov(mod, adjust =0)`` I had experimental treatments, sites, and their interaction as the three different models to calculate this % variation explained by the model. (in results tables) I also built a table reporting D (difference in log-likelihood), residual DF, and a p-value estimate using the anova.gllvm function to do the type II marginal comparisons (in Supplement) I was not sure if this information would be valuable to reviewers/readers, so I also included a information criteria table, presenting some of the values that are printed when viewing the gllvm itself. AIC, AICc, BIC, Log-Like ratio

I'll consider this issue closed! Thank you both @BertvanderVeen and @JenniNiku for the help here and for your amazing work on this package! If you are interested, I will keep you updated on how these analyses are received by reviewers.

BertvanderVeen commented 3 years ago

Glad it worked out, and good luck with the review process! Very interested to hear how your publication progresses.

fhui28 commented 2 years ago

I was talking to Bert recently, and think adding SEs to prediction should be something pushed up either on the to-do list. Even if it is something relatively "brain-dead" like adapting what is out of predict.boral and simulating out prediction intervals, then at least something is offered. At the moment I am unsure what complication VA brings into this.

jebyrnes commented 2 years ago

I'm going to second this, as it might be really nice to be able to include methods for gllvm objects for use by the emmeans package for post-hoc evaluations of differences. Should I open a new issue for that?

BertvanderVeen commented 2 years ago

I'm going to second this, as it might be really nice to be able to include methods for gllvm objects for use by the emmeans package for post-hoc evaluations of differences. Should I open a new issue for that?

Yes please. I've considered this in the past but haven't gotten around to doing it, perhaps after I've handed in my thesis.

jebyrnes commented 2 years ago

Good luck!

jebyrnes commented 2 years ago

As a halfway-house, while I can calculate pairwise differences between modeled response variables, is there a straightforward way to also calculate the CI of that difference - for example.....

Let's say you had the palmerpenguins data and wanted to look at the difference between males and females in a variety of measurements based on modeled results.


library(palmerpenguins)
penguins <- na.omit(penguins)

Y <- with(penguins, data.frame(bill_length_mm, bill_depth_mm, 
                          flipper_length_mm, body_mass_g))

X <- with(penguins, data.frame(species, sex))

mod_gllvm <- gllvm(y = Y, X = X,
                   formula = ~species*sex,
                   family = gaussian(link = "identity"),
                   num.lv=2,
                   starting.val = 'zero')

OK, you can generate predictions..... but not get error at this point.


# Get the difference between sexes for each species
pred_frame <- expand.grid(species = levels(X$species),
                          sex = levels(X$sex))

#get predictions - but, no SE on predict
pred_frame <- cbind(pred_frame,
                    predict(mod_gllvm, newX = pred_frame,
                      level = 0) )

You can then generate differences and plot them


# no way around using dplyr and tidyr here!
library(dplyr)
library(tidyr)

diff_frame <- pred_frame %>%
    pivot_longer(bill_length_mm:body_mass_g) %>%
    pivot_wider(names_from = sex, values_from = value) %>%
    mutate(diff = female - male)

library(ggplot2)
ggplot(diff_frame, 
       aes(x = species, y = diff)) +
    facet_wrap(vars(name), scale = "free_x") +
    geom_point() +
    geom_hline(yintercept = 0, lty = 2) +
    coord_flip() +
    labs(x = "Difference between Female and Male")

image

But........ you don't have any SE or CI here.......

Is there a way to generate those somewhere along the way? Even getting SEs on the prediction, one could at least sum those as an overestimate of the variance in the difference.

Thoughts?

REPREX at https://gist.github.com/jebyrnes/bdc69e611913ea496a21d62edb36aa23

fhui28 commented 2 years ago

A somewhat ad-hoc but not completely mindless approach would be to generate lots of parameter samples from their approximate normal distribution. Then construct predictions from these to have a sample of predictions, from which you can then empirical quantiles or whatever jazz you want to do with them. Think of as the frequentist analogue having MCMC samples had you applied a Bayesian approach e.g., using boral...not to blow my own trumpet.

Particularly with the Palmer Penguin example where the predictions are defined with the LVs set to zero, this ends being relatively straightforward. Below is some code that I think should work, building on after pred_frame has been constructed

# Basic naming of dimensions of covariance matrix to make life easier [optional]
colnames(mod_gllvm$Hess$cov.mat.mod) <- rownames(mod_gllvm$Hess$cov.mat.mod) <- names(mod_gllvm$TMBfn$par[mod_gllvm$Hess$incl])

# Begin advanced naming
getbeta_names <- data.frame(intercept = mod_gllvm$params$beta0, mod_gllvm$params$Xcoef) %>% 
   rownames_to_column(var = "measurements") %>%
   pivot_longer(-measurements) %>% 
   dplyr::select(measurements:name) %>% 
   apply(., 1, function(x) paste0(x, collapse = ":"))   
num_beta <- length(getbeta_names)

colnames(mod_gllvm$Hess$cov.mat.mod)[1:num_beta] <- rownames(mod_gllvm$Hess$cov.mat.mod)[1:num_beta] <- getbeta_names

# Check names are OK by confirming standard errors match
mod_gllvm$Hess$cov.mat.mod %>% 
   diag %>% 
   sqrt %>% 
   {.[1:num_beta]}

cbind(mod_gllvm$sd$beta0, mod_gllvm$sd$Xcoef)

# Simulate from approximate large sample distributon of beta/Xcoef
# Due to normality and the fact that your predictions are based on setting the LVs to zero, then it does not better than you only generate a subset of all the parameters. 
# Other types of predictions would be more complicated though.
num_sims <- 500
simcoefs <- rmvnorm(num_sims, mean = mod_gllvm$TMBfn$par[mod_gllvm$Hess$incl][1:num_beta], sigma = mod_gllvm$Hess$cov.mat.mod[1:num_beta, 1:num_beta])
colnames(simcoefs) <- getbeta_names
pred_model_mat <- model.matrix(mod_gllvm$formula, pred_frame)

# Construct predictions for each simulated set of parameters to get a set of simulated predictions
all_preds <- array(NA, dim = c(num_sims, nrow(getpreds), ncol(Y)), dimnames = list(sims = 1:num_sims, unit = 1:nrow(pred_frame), measurement = colnames(Y))) 
for(k0 in 1:num_sims) {
   cwcoefs_mat <- matrix(simcoefs[k0,], nrow = ncol(Y), byrow = TRUE)
   all_preds[k0,,] <- tcrossprod(pred_model_mat, cwcoefs_mat)
   }

# Get mean and quantiles of the predictions. If one want differences or other jazz, then you need to do some manipulation before taking quantiles. But hopefully one gets the gist
apply(all_preds, c(2,3), mean)
apply(all_preds, c(2,3), quantile, prob = c(0.025, 0.975))

Hopefully the above code works.

If the predictions involved LVs then things get a bit messier, and some more thought might be needed as per what Bert and Jenni mention above.

jebyrnes commented 2 years ago

Oh, that's really nice, as you can then manipulate the rest!


a <- as_tibble(all_preds) %>%
    mutate(sim = 1:n()) %>%
    pivot_longer(-sim) %>%
    mutate(name = gsub("^([[:digit:]])+\\.", "", name))

b <- rep_data_frame(pred_frame, dim(all_preds)[1] *dim(all_preds)[3])

p <- bind_cols(a,b)

#posthoc calc
diffs <- p %>%
    group_by(sim, name, species) %>%
    arrange(sex) %>%
    summarize(diff = value[1] - value[2]) %>%
    ungroup() %>%
    group_by(species, name) %>%
    summarize(mean_diff = mean(diff),
              lq = quantile(diff, prob = 0.025),
              uq = quantile(diff, prob = 0.975))

ggplot(diffs,
       aes(x = species,
           y = mean_diff,
           ymin = lq,
           ymax = uq)) +
    geom_point() +
    geom_linerange() +
    coord_flip() +
    facet_wrap(vars(name), scale = "free") +
    geom_hline(yintercept = 0, lty = 2) 

Hrm- I might wrap the prediction simulations into a function that gives an output akin to tidybayes. I've been playing around with a generalized package for this kind of simulation - https://github.com/jebyrnes/sinterval - although I haven't worked on it in a bit. This seems like a good general method for it! Just need a way to get the correct prediction intervals (a pain for non-gaussians, I know), and I could wrap it all in there?

jebyrnes commented 2 years ago

Gist for a function get_sim_fit_gllvm() - https://gist.github.com/jebyrnes/a60ec172a391ed8dd3d44cbe1eed0bed

fhui28 commented 2 years ago

Yeah go for it @jebyrnes you can use the code however you like =D As mentioned before though, it has not dealt with the issue of predictions involving LVs, and in general what one does with a simulation-based approach in random effects is a good question to think about [if I had more time...sigh].

Nice going with https://github.com/jebyrnes/sinterval ! I know of a few places/packages that have done this for specific models, but nothing more general so an awesome idea to supply one to practitioners! Whether you call this approach simulation or parametric bootstrap or dirty Bayes or something else I will leave to future pub chat. You probably also want to check out the relative new gam.mh function in the mgcv package.

jebyrnes commented 2 years ago

Thanks! Yeah, I think there's a deeper conversation to be had about LVs versus observed variables.... and modeling effects on LVs versus OVs (which warms my SEM heart). It would be great to get deeper into it..... particularly for calculating posthocs here. I'm guessing that the CIs will be a bit wide, as they don't take those correlations into account, which means that the results will be conservative, but, so it goes for the moment in methodological development? (Yes, I am thinking about these issues because of a project - looking at change in kelp forest communities from surveys in 2014 and 2018).

Thanks for your kind comments about sinterval (and I'm on board for dirty Bayes). I'm waffling on pushing it to CRAN as-is, or trying to at least get lme4 into it via merTools::predict_interval before I push it in. I'm hoping that I can get some community involvement to then keep pushing it forward for other model types, like broom. A guy can dream, I suppose?

BertvanderVeen commented 2 years ago

With respect to the emmeans subject above.

I have now looked into the possibility of emmeans support. However, unlike in many other r packages, gllvm has multiple respondes and that concept is difficult to unify with the emmeans approach. I don't see this happening in the near future.

jebyrnes commented 2 years ago

Drat - no way to do this with an lapply or somesuch? Or even choosing a specific target response variable?

BertvanderVeen commented 2 years ago

Yes, that would be the way to go, but emmeans does not provide an interface for that.. So that we would have to build a wrapper around emmeans, and fool it into thinking that we have a univariate response.

Besides that it's the recent addition of reduced rank regression that makes things difficult, as it requires calculating the covariance matrix of the reduced rank approximated fixed effects, and the usual fixed effects, which needs more time. Specifically, the covariance matrix for quadratic response model + reduced rank regression includes a very tedious calculation that I have yet to code up.

So supporting emmeans is far from impossible, but it's also definitely not low hanging fruit.