Open MaximilianNuber opened 5 months ago
P.S. I am posting this question in MixedModels.jl, as in GLM.jl I would be able to use X, y as input for lm or glm. I haven't found this possibility for MixedModels, but I do see that for specifying random effects, we need the formula language. I just don´t know to do it in this case.
What you want to do is related to profiling the log-likelihood with respect to the fixed-effects coefficients, for which the code is in src/profile/fixefpr.jl. That code is more general than you need but it shows that you can create a reduced model by creating a fixed-effects term and then copying the response, the reterm (which needs a deepcopy or the special copy method in the file mentioned above) and the formula. Try this.
using MixedModels
function dropcol(m::LinearMixedModel, j::Integer)
fe = m.feterm # model matrix X, column names, rank, etc.
x = fe.x # model matrix
fe.rank == size(x, 2) || throw(ArgumentError("m.X is not of full column rank"))
notj = deleteat!(collect(axes(x, 2)), j) # indirectly checks that j is in range
feterm = MixedModels.FeTerm(x[:, notj], fe.cnames[notj])
θbak = m.θ # save a copy of the parameter estimates
setθ!(m, m.optsum.initial) # ensure m is at initial values
mnew = LinearMixedModel(collect(m.y), feterm, deepcopy(m.reterms), m.formula)
objective!(m, θbak) # restore parameter values in m
return mnew
end
m1 = let f = @formula y ~ 1 + dept + service + (1|s) + (1|d)
fit(MixedModel, f, MixedModels.dataset(:insteval))
end
m1a = fit!(dropcol(m1, 2))
MixedModels.likelihoodratiotest(m1a, m1)
Dear @dmbates,
Thank you so much, this works perfectly.
If I may, I would like to ask a follow-up question, about the correctness of my application to GeneralizedLinearMixedModel
:
I applied your dropcol
to GeneralizedLinearMixedModel
by adjusting the function to include a target-vector y.
I assume that the LMM within a GLMM carries a y-vector adjusted to the Distribution of the GLMM, therefore we need to include the original response.
function dropcol2(m::LinearMixedModel, j::Integer, y2)
fe = m.feterm # model matrix X, column names, rank, etc.
x = fe.x # model matrix
fe.rank == size(x, 2) || throw(ArgumentError("m.X is not of full column rank"))
notj = deleteat!(collect(axes(x, 2)), j) # indirectly checks that j is in range
feterm = MixedModels.FeTerm(x[:, notj], fe.cnames[notj])
θbak = m.θ # save a copy of the parameter estimates
setθ!(m, m.optsum.initial) # ensure m is at initial values
mnew = LinearMixedModel(collect(y2), feterm, deepcopy(m.reterms), m.formula)
MixedModels.objective!(m, θbak) # restore parameter values in m
return mnew
end
Then, we contruct a new GeneralizedLinearMixedModel
as in MixedModels.jl/src/generalizedlinearmixedmodel.jl:
function dropcol_glmm(glmm::GeneralizedLinearMixedModel, j::Integer)
MixedModels.unfit!(glmm)
LMM = dropcol2(glmm.LMM, j, glmm.y)
wts = []
offset = []
T = eltype(LMM.Xymat)
y = copy(LMM.y)
d = Bernoulli()
l = canonicallink(d)
dofit = size(LMM.X, 2) != 0 # GLM.jl kwarg
gl = glm(LMM.X, y, d, l;
wts=convert(Vector{T}, wts),
dofit,
offset=convert(Vector{T}, offset)
)
β = dofit ? coef(gl) : T[]
u = [fill(zero(eltype(y)), MixedModels.vsize(t), MixedModels.nlevs(t)) for t in LMM.reterms]
vv = length(u) == 1 ? vec(first(u)) : similar(y, 0)
res = GeneralizedLinearMixedModel{T,typeof(d)}(
LMM,
β,
copy(β),
LMM.θ,
copy.(u),
u,
zero.(u),
gl.rr,
similar(y),
oftype(y, wts),
similar(vv),
similar(vv),
similar(vv),
similar(vv),
)
return res
end
As an adhoc example:
data = MixedModels.dataset(:insteval)
data = DataFrame(data)
data.service = map(x -> ifelse(x == "N", 0, 1), data.service)
fm = @formula(service ~ studage + lectage + (1|s))
m = GeneralizedLinearMixedModel(fm, data, Bernoulli())
fit!(m; fast = true)
Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
service ~ 1 + studage + lectage + (1 | s)
Distribution: Bernoulli{Float64}
Link: LogitLink()
logLik deviance AIC AICc BIC
-47205.8133 94411.6267 94431.6267 94431.6297 94523.6663
Variance components:
Column Variance Std.Dev.
s (Intercept) 0.437793 0.661659
Number of obs: 73421; levels of grouping factors: 2972
Fixed-effects parameters:
────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
────────────────────────────────────────────────────
(Intercept) -0.123563 0.0285926 -4.32 <1e-04
studage: 4 -0.387704 0.0419093 -9.25 <1e-19
studage: 6 -1.04746 0.0421811 -24.83 <1e-99
studage: 8 -1.27521 0.0452258 -28.20 <1e-99
lectage: 2 0.184468 0.0233174 7.91 <1e-14
lectage: 3 0.530911 0.0279009 19.03 <1e-80
lectage: 4 0.718081 0.0297538 24.13 <1e-99
lectage: 5 1.16544 0.0340265 34.25 <1e-99
lectage: 6 1.52207 0.0301753 50.44 <1e-99
────────────────────────────────────────────────────
m2 = dropcol_glmm(m, 2)
fit!(m2, fast = true)
Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
service ~ 1 + studage + lectage + (1 | s)
Distribution: Bernoulli{Float64}
Link: LogitLink()
logLik deviance AIC AICc BIC
-47294.2634 94588.5267 94606.5267 94606.5292 94689.3624
Variance components:
Column Variance Std.Dev.
s (Intercept) 0.406962 0.637936
Number of obs: 73421; levels of grouping factors: 2972
Fixed-effects parameters:
────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
────────────────────────────────────────────────────
(Intercept) -0.298664 0.0107902 -27.68 <1e-99
studage: 6 -0.926317 0.0167307 -55.37 <1e-99
studage: 8 -1.0441 0.0178362 -58.54 <1e-99
lectage: 2 0.193725 0.0108125 17.92 <1e-71
lectage: 3 0.475043 0.0127252 37.33 <1e-99
lectage: 4 0.674609 0.0137594 49.03 <1e-99
lectage: 5 1.11943 0.0158697 70.54 <1e-99
lectage: 6 1.46943 0.0137901 106.56 <1e-99
────────────────────────────────────────────────────
My point in all of this is to recreate the approach of the MAST package for single-cell analysis, where I need to fit both a LinearMixedModel
and a Logistic regression.
Again, thank you for your help.
Best, Max
I think that will work. We could try to make it easier to construct a Generalized Linear Mixed Model
from the feterm
and reterms
but I am a bit reluctant to make that too easy. In general it is not a good idea to concentrate on individual coefficients from a categorical covariate but I can understand that in single-cell analysis it may be a convenient way to phrase that problem.
I did write a reduced set of types and methods to fit a logistic GLMM in https://embraceuncertaintybook.com/aGHQ.html If you were interested we could try to modify those for a special-purpose GLMM implementation. If this seems worthwhile to you, could you suggest where I should start reading about the MAST approach? Is the 2015 Genome Biology paper the best place to start or should I start with the vignettes in the BioConductor package?
Feel free to contact me at my gmail.com (same username as on github) address to take this discussion offline.
@MaximilianNuber Is it correct that there is only one random-effects term in the LMMs and GLMMs that you wish to fit and that it is of the form (1|samp)
? If so, there are considerable simplifications that can be made in fitting the model and refitting it with individual columns in X removed.
@dmbates Yes, the only random effect in the model is in the form (1|sample)
, with the goal to retain sample structure. In larger or several datasets as my own I have been wondering if the pool/sequencing run/dataset should be added as well, but I haven´t found this attempt elsewhere so I would keep it at the sample level for now.
For more in-depth information on your question I am preparing a mail.
Dear all,
I am fitting MixedModels for large single cell datasets. One approach to test for covariates is to drop a single coefficient from the modelmatrix, which belongs to one level of a categorical variable. I have not been able to find this possibility in Julia.
As an example, let me use the iris dataset. It may not be a mixed model, but in principle it is the same.
We have the full model:
full = lm(@formula(SepalLength ~ Species), iris)
with resultsMy goal would be, to drop
Species: setosa
in the formula or model matrix for another linear model. This nested model would be compared to the full model by likelihood ratio test.Everything I found so far would be dropping the
Species
variable entirely, but it is not what I want.I have played around with contrasts, where in the example the modelmatrix column for
Species: setosa
would be all zeros, but got the warning that the model is not full rank.Is there any solution to this, or possibly a workaround?
Thank you for any help, Max