Open palday opened 3 years ago
I can work on some of these if that would help.
That would be great.
I'm wondering how far down the rabbit hole to jump with this one. (It will probably be very far if past experience is a guide.) It seems we are inconsistent on some methods between LMMs and GLMMs. In particular for dof
an LMM doesn't add in nθ
but the method for a GLMM does.
So my plan is:
src/mixedmodel.jl
instead of separate methods in src/linearmixedmodel.jl
and src/gneralizedlinearmixedmodel.jl
Does that seem to be giving myself enough rope to mess things up?
Yes. And I think it makes sense to have the StatsBase API just redirect to various properties. That way people can still abstract away via the functional interface when e.g. writing plotting recipes and the like.
I'm not sure how well some of these are defined / meaningful for the mixed-model case. The RSS is clearly defined, but the PWRSS is much more interesting. I almost feel like adding these functions is as much as docs issue as a writing a few short functions.
One other thing I would like to add to the docs is a more explicit discussion of why we have the actual deviance and log likelihood for GLMM, but only -2 log likelihood for LMM.
Should the functions for which we have defined methods be re-exported?
Yes.
What do you think we should do about our inconsistency on whether the covariance parameters should be counted in dof_residual
? We do count them consistently in dof
so the question is whether we think the relationship
dof_residual(m) = nobs(m) - dof(m)
should be preserved (I don't think so b/c dof
counts 1 for the per-observation noise variance) or we should just use
dof_residual(m) = nobs(m) - m.feterm.rank
The more ambitious definition would be to evaluate sum(leverage(m))
as the trace of the hat matrix. Of course, then we would need to explain and contend with all the "but the right answer from SAS/lme4 is ..." comments.
I think the question should be: what do we want for the default? Because we can also support additional methods dof_residual(::LinearMixedModel; variant::Symbol)
where variant in (:leverage, :ferank, :nparam)
.
At one point you made a pretty compelling argument for sum(leverage(m))
. If I'm not mistaken, that would also allow us to use t
instead of z
in our CoefTable
(of course, we would then need a direct dependency on Distributions.jl
to compute the p-values). (If I am mistaken, please do explain which things I'm mixing up!). But then we run into computational complexity issues on leverage
.
But regardless of the path we go here, we should probably add a page to the docs on the degrees of freedom. You have some good old posts in various mailing lists and fora that could serve as the basis and once again driving home the point that a lot of "obvious" things in classical regression just break down in the mixed-effects context. This will also hopefully give a place to point to on why Satterthwaite and Kenward-Roger approximations may not help as much as people think.
Looking at the remaining items, here are my thoughts:
A
with the response trimmed off)?pwrss
instead. Or have rss
be a 'deprecated' (before ever being defined) alias to pwrss
loglikelihood(model, :)
. I got this added to the StatsBase API a while back so that I could implement Vuong's Test in MixedModelsExtras, but I consider this fairly low priority.shortestcovint
? Or just error and say "checkout the bootstrap + shortestcovint
"? Several of these may require us to decide if we want to generalize the evaluation of the terms in a model or generalize the concept.
For example, the formula for Cook's Distance involves the standardized residuals and the leverage values. We need to make sure that those are evaluated in ways that are consistent. In other words, if we use the penalized residual sum of squares in standardizing the residuals (I'm not even sure that would make sense) then we need to decide what the effective degrees of freedom for residuals should be, etc. So I think more will be involved here than trying to generalize terms in a formula.
For the cross model matrix there is a simple answer of p = m.feterm.rank; Symmetric(view(last(m.A), 1:p, 1:p))
if we ignore the blocking factors, which I think we probably should.
For confint
I think we should use the Wald approximation and a docstring to promote shortestcovint
. For the fixed-effects parameters this will likely be sufficient. For confidence intervals on the variance components or correlation parameters I think we need to insist on using a bootstrap sample or the yet-to-be-written profiling results.
For the information matrix I think we should pass. If attention is only on the fixed-effects parameters this is essentially like evaluating the cross model matrix. If you throw in variance components and correlation parameters then it is probably not a good idea to report it because it assumes that the log-likelihood is quadratic in the parameters of interest, which it isn't.
pwrss
instead