StatsBase Modelling API

palday commented 3 years ago

[x] dof_residual(model::RegressionModel)
[x] fitted(model::RegressionModel)
[x] modelmatrix(model::RegressionModel)
[x] predict(model::RegressionModel)
[x] residuals(model::RegressionModel)
[x] response(model::RegressionModel)
[ ] cooksdistance(model::RegressionModel)
[ ] crossmodelmatrix(model::RegressionModel)
[x] leverage(model::RegressionModel)
[x] meanresponse(model::RegressionModel)
[x] predict!(model::RegressionModel)
[x] responsename(model::RegressionModel)
~adjr2(model::StatisticalModel)~ won't implement here
~adjr2(model::StatisticalModel, variant::Symbol)~ won't implement here
[x] coef(model::StatisticalModel)
[x] coefnames(model::StatisticalModel)
[x] coeftable(model::StatisticalModel)
[ ] confint(model::StatisticalModel)
[x] deviance(model::StatisticalModel)
[x] dof(model::StatisticalModel)
[x] fit(model::StatisticalModel, args...)
[x] fit!(model::StatisticalModel, args...)
[x] loglikelihood(model::StatisticalModel)
[x] model_response(obj::StatisticalModel) now replaced by response(obj::StatisticalModel)
[x] nobs(model::StatisticalModel)
~nulldeviance(model::StatisticalModel)~ not clearly defined (i.e. is "intercept-only" in just the fixed effects, just the random effects or both?)
~nullloglikelihood(model::StatisticalModel)~ not clearly defined
~r2(model::StatisticalModel)~
~r2(model::StatisticalModel, variant::Symbol)~
[x] stderror(model::StatisticalModel)
[x] vcov(model::StatisticalModel)
[x] aic(model::StatisticalModel)
[x] aicc(model::StatisticalModel)
[x] bic(model::StatisticalModel)
[ ] informationmatrix(model::StatisticalModel; expected)
[x] isfitted(model::StatisticalModel)
[x] islinear(model::StatisticalModel)
[ ] mss(model::StatisticalModel)
[ ] ~rss(model::StatisticalModel)~ we use pwrss instead
[ ] score(model::StatisticalModel)
[ ] weights(model::StatisticalModel)

dmbates commented 3 years ago

I can work on some of these if that would help.

palday commented 3 years ago

That would be great.

dmbates commented 3 years ago

I'm wondering how far down the rabbit hole to jump with this one. (It will probably be very far if past experience is a guide.) It seems we are inconsistent on some methods between LMMs and GLMMs. In particular for dof an LMM doesn't add in nθ but the method for a GLMM does.

So my plan is:

use properties more than extractor functions to mask over the differences between LMMs and GLMMs
try to make as many of these methods common to LMMs and GLMMs. That is, include them in src/mixedmodel.jl instead of separate methods in src/linearmixedmodel.jl and src/gneralizedlinearmixedmodel.jl

Does that seem to be giving myself enough rope to mess things up?

palday commented 3 years ago

Yes. And I think it makes sense to have the StatsBase API just redirect to various properties. That way people can still abstract away via the functional interface when e.g. writing plotting recipes and the like.

I'm not sure how well some of these are defined / meaningful for the mixed-model case. The RSS is clearly defined, but the PWRSS is much more interesting. I almost feel like adding these functions is as much as docs issue as a writing a few short functions.

One other thing I would like to add to the docs is a more explicit discussion of why we have the actual deviance and log likelihood for GLMM, but only -2 log likelihood for LMM.

dmbates commented 3 years ago

Should the functions for which we have defined methods be re-exported?

palday commented 3 years ago

Yes.

dmbates commented 3 years ago

What do you think we should do about our inconsistency on whether the covariance parameters should be counted in dof_residual? We do count them consistently in dof so the question is whether we think the relationship

dof_residual(m) = nobs(m) - dof(m)

should be preserved (I don't think so b/c dof counts 1 for the per-observation noise variance) or we should just use

dof_residual(m) = nobs(m) - m.feterm.rank

The more ambitious definition would be to evaluate sum(leverage(m)) as the trace of the hat matrix. Of course, then we would need to explain and contend with all the "but the right answer from SAS/lme4 is ..." comments.

palday commented 3 years ago

I think the question should be: what do we want for the default? Because we can also support additional methods dof_residual(::LinearMixedModel; variant::Symbol) where variant in (:leverage, :ferank, :nparam).

At one point you made a pretty compelling argument for sum(leverage(m)). If I'm not mistaken, that would also allow us to use t instead of z in our CoefTable (of course, we would then need a direct dependency on Distributions.jl to compute the p-values). (If I am mistaken, please do explain which things I'm mixing up!). But then we run into computational complexity issues on leverage.

But regardless of the path we go here, we should probably add a page to the docs on the degrees of freedom. You have some good old posts in various mailing lists and fora that could serve as the basis and once again driving home the point that a lot of "obvious" things in classical regression just break down in the mixed-effects context. This will also hopefully give a place to point to on why Satterthwaite and Kenward-Roger approximations may not help as much as people think.

palday commented 3 years ago

Looking at the remaining items, here are my thoughts:

Cook's Distance: if I recall correctly, this can be expressed in terms of the diagonal of the hat matrix. Will the work in #553 be useful for that?
Cross model matrix: should we return this on the fixed effects or the full matrix (unblocked A with the response trimmed off)?
R^2 measures and null likelihood/deviance -- I've crossed these off the list, but maybe we should throw a helpful error instead. I do have a horribly simplistic R^2 defined over in https://github.com/palday/MixedModelsExtras with lots of warnings in the doc string. We can also just leave them undefined and point to your various posts over the years highlighting the problems with these in mixed models if asked.
RSS - we could error and tell users to see pwrss instead. Or have rss be a 'deprecated' (before ever being defined) alias to pwrss
MSS - should be straightforward
missing from this list is an elementwise computation of the log likelihood: loglikelihood(model, :). I got this added to the StatsBase API a while back so that I could implement Vuong's Test in MixedModelsExtras, but I consider this fairly low priority.
weights: this is trivial
information matrix and score -- I must admit, this is a glaring hole in my stats background (and reveals my lack of formal training), but I recall correctly, we would need the gradient for this. In which case, I would open a separate issue for "things we'll do if/when we have the gradient computation".
confint -- should we return the Wald approximation? And add a docstring that pushes users to use shortestcovint? Or just error and say "checkout the bootstrap + shortestcovint"?

dmbates commented 3 years ago

Several of these may require us to decide if we want to generalize the evaluation of the terms in a model or generalize the concept.

For example, the formula for Cook's Distance involves the standardized residuals and the leverage values. We need to make sure that those are evaluated in ways that are consistent. In other words, if we use the penalized residual sum of squares in standardizing the residuals (I'm not even sure that would make sense) then we need to decide what the effective degrees of freedom for residuals should be, etc. So I think more will be involved here than trying to generalize terms in a formula.

For the cross model matrix there is a simple answer of p = m.feterm.rank; Symmetric(view(last(m.A), 1:p, 1:p)) if we ignore the blocking factors, which I think we probably should.

For confint I think we should use the Wald approximation and a docstring to promote shortestcovint. For the fixed-effects parameters this will likely be sufficient. For confidence intervals on the variance components or correlation parameters I think we need to insist on using a bootstrap sample or the yet-to-be-written profiling results.

For the information matrix I think we should pass. If attention is only on the fixed-effects parameters this is essentially like evaluating the cross model matrix. If you throw in variance components and correlation parameters then it is probably not a good idea to report it because it assumes that the log-likelihood is quadratic in the parameters of interest, which it isn't.

JuliaStats / MixedModels.jl

StatsBase Modelling API #514