GLMMs - Githubissues

kleinschmidt commented 9 years ago

I'm wondering what, if any, plans there are for support of GLMMs in this package.

I'd be happy to help, but after doing some reading about the general methods that are used to approximate the likelihood for GLMMs in lme4 I'm thinking that honestly it will take a lot to get myself up to speed. However, much of the data I deal with is binary or count data, so I'm highly motivated (as are some of my colleagues) and have a bit more free time over the summer to work on it.

The most obvious place, to me, to start is to look at the source code for lme4, but as far as I know that's verboten because it's GPL-licensed and so anything derived from it can't be included in this package which is MIT-licensed. Is that correct?

dmbates commented 9 years ago

Thanks for your interest in this package.

GLMMs are definitely on the "To Do" list but not at the top yet. I am currently working on reformulating the underlying penalized least squares (PLS) problem for linear mixed models (LMMs). To avoid making too big a mess of this package I created another package, ReTerms, for that work.

As you point out, the lme4 code is under the GPL and hence code derived from it would also need to fall under the GPL. Thus it is best to avoid reading the GPLd code. I can base my code on those methods because I wrote much of the lme4 code and developed the methods. In any case, the lme4 code is not easy to read and I wouldn't start with that.

One thing that would help is to get an idea of what kinds of models and data sets you use. There are always trade-offs in where to concentrate and whether to aim for full generality or to concentrate on a few special cases. I have the feeling that the ability to fit Bernoulli, Binomial and Poisson families with the canonical links would encompass most applications but I have no objective evidence to back this up.

We can carry on this discussion via email if you prefer.

kleinschmidt commented 9 years ago

Let's continue via email, but I'll note here for the record the kind of data I and my colleagues are typically dealing with.

All of the data that I (and, AFAIK, my colleagues) are analyzing with GLMMs is covered by that case (canonical link functions and Bernoulli, Binomial, & Poisson families). Here's an example of a dataset I'm working with at the moment, in the form of an R package: https://github.com/kleinschmidt/phonetic-sup-unsup

In the full dataset there are ~90k observations, and in the GLMMs I think I had something like four fixed effect predictors (plus all their interactions) with random slopes for two of them (plus one interaction) at one grouping level. It wasn't an issue in this experiment, but crossed random effects (with random slopes) are very common in our work, which is psycholinguistics and has subjects and items as grouping factors. So that's the main, non-trivial concern for us.

mguzmann commented 9 years ago

Would it be possible to follow the progress of this? I am very interested since I use GLMMs in a lot of my work...

2015-06-15 15:25 GMT+02:00 Dave Kleinschmidt notifications@github.com:

Let's continue via email, but I'll note here for the record the kind of data I and my colleagues are typically dealing with.

All of the data that I (and, AFAIK, my colleagues) are analyzing with GLMMs is covered by that case (canonical link functions and Bernoulli, Binomial, & Poisson families). Here's an example of a dataset I'm working with at the moment, in the form of an R package: https://github.com/kleinschmidt/phonetic-sup-unsup

In the full dataset there are ~90k observations, and in the GLMMs I think I had something like four fixed effect predictors (plus all their interactions) with random slopes for two of them (plus one interaction) at one grouping level. It wasn't an issue in this experiment, but crossed random effects are very common in our work, which is psycholinguistics and has subjects and items as grouping factors. So that's the main, non-trivial concern for us.

— Reply to this email directly or view it on GitHub https://github.com/dmbates/MixedModels.jl/issues/31#issuecomment-112069214 .

dmbates commented 9 years ago

Some thoughts on what needs to be done to incorporate GLMMs. These are technical considerations that may seem gibberish but do relate to the overall difficulty of the task.

Tasks for extending the MixedModels package to fit GLMMs:

[ x] Incorporate some form of distribution and link types.
- [x ] Either require the GLM package or reimplement the necessary types and methods in this package.
- [x ] Should use of the canonical link be treated as a special case?
Most of the applications of GLMs and GLMMs are for binary responses or responses representing counts. The distributions used are Bernoulli (Binomial when trials are amalgamated) for binary and Poisson for counts. The logit link for Bernoulli and the log link for Poisson are canonical links that can be derived from the form of the distribution. Part of the IRLS algorithm for general link functions, such as a probit link for binary data, can be by-passed for canonical links. Restricting to canonical links makes things a bit simpler but is, well, a restriction. Is it important to allow for non-canonical link functions?
[ x] Extend methods for penalized least squares (PLS) to penalized iteratively reweighted least squares (PIRLS).

The PLS problem has a direct solution. PIRLS is an iterative algorithm in which each iteration requires a PLS solution with a new set of weight and an offset. The PIRLS optimization is performed within each evaluation of the log-likelihood to be optimized w.r.t. the covariance parameters and the fixed effects. An important consideration is the convergence criterion to be used in PIRLS. It is probably a waste to spend too much time getting to the best possible answer of the PIRLS problem, only to change parameters and start over again.
[x ] Select the approximation(s) to the log-likelihood to be used at the conditional modes of the random effects

The log-likelihood is defined as an integral of the unscaled conditional density of the random effects. In general this is a high dimensional integral. The "conditional modes" of the random effects, determined by PIRLS, are the values that maximize this unscaled conditional density. (When determining the location of the maximum it doesn't matter that you only know the density up to a constant factor.) If the random effects are determined by one factor only, say Subject, then it is possible to factor this integral into a product of univariate or at least low-dimensional integrals for which Gauss-Hermite methods can be used. If the random effects are defined with respect to, say, Subject and Item, then you are stuck with the high-dimensional integral for which the Laplace approximation is about the best you can do.

(Update: Currently using the Laplace approximation)

[x ] Shall the fixed-effects parameters be part of the general optimization or part of the PIRLS evaluation?

(Update: Currently, fixed-effects are always part of the general optimization)

In the [lme4 package](https://github.com/lme4/lme4) for [R](http:www.R-project.org) we choose to do both.  In the first stage the fixed-effects parameters are part of the PIRLS optimization and only the covariance parameters are in the general optimizer.  In the second stage only the random effects are in the PIRLS optimization and the fixed effects are in the general optimizer.  I think it would be simpler to omit the first stage for the time being and keep only the random effects in the PIRLS optimization.

[ x] Create appropriate types and extractors for GLMMs

kleinschmidt commented 9 years ago

Thanks, having that list is very helpful. I think that requiring GLM.jl seems like the best way to go, rather than duplicating code between the two packages. I also think that, to start, it might be good to just always assume the canonical link. That will simplify development and in no way precludes us from including support for other link functions later on.

I've been reading the Pinheiro & Chao (2006) paper that describes the Laplacian and AGQ approximations to the likelihood and includes a discussion of finding the conditional modes via PIRLS. I take it this is what you're suggesting?

I don't have a strong opinion about whether to include the fixed effects themselves in the PIRLS optimization. I take it that, if they're not, then they will be optimized by the general optimizer used to update the variance-covariance parameters based on the approximate likelihood?

dmbates commented 8 years ago

GLMMs are now under active development.

I have created a branch called glmms which includes the GLM package. I'm currently working on the PIRLS algorithm.

dmbates commented 8 years ago

I have merged the glmms branch. Still working on the PIRLS algorithm implementation.

JuliaStats / MixedModels.jl

GLMMs #31