JuliaStats / MixedModels.jl

A Julia package for fitting (statistical) mixed-effects models
http://juliastats.org/MixedModels.jl/stable
MIT License
407 stars 48 forks source link

Inconsistent GeneralizedLinearMixedModel and LinearMixedModel behavior #185

Closed Nosferican closed 5 years ago

Nosferican commented 5 years ago

Example data MixedModels is master.

using CSV, DataFrames, MixedModels
data = CSV.read(data)
categorical!(data, [:id, :sequence, :period, :formulation])
# This errors
fit!(GeneralizedLinearMixedModel(@formula(AUC ~ formulation + sequence + period + (1|id)), data, Normal()))
ArgumentError: invalid NLopt arguments: zero step size

# This works
fit!(LinearMixedModel(@formula(AUC ~ formulation + sequence + period + (1|id)), data))

is that intended?

Nosferican commented 5 years ago

Bump. I was waiting on trying to understand this issue, but if the release might happen before that gets addressed I can add one PR for fit(::LinearMixedModel, formula, data, args..., kw...).

dmbates commented 5 years ago

Now that the v2.0.0 release has occurred, would you like to revisit this issue?

It is a real problem to try to fit a linear model or linear mixed model as a generalized linear model or a generalized linear mixed model because the direct calculation of the least squares or penalized least squares solution is phrased as a iterative calculation. Evaluation of convergence criteria, etc. can get tricky.

One workaround is to catch attempts to fit a GLM or GLMM with the Normal() family and its canonical Identity() link and fit it as a LM or LMM. That is unsatisfactory because you shouldn't start with a GeneralizedLinearMixedModel object, invoke fit! on it, and return a LinearMixedModel object.

Another workaround is to return an error that effectively says "Don't do that" and refer the user to LinearMixedModel. That tends to make the user a bit grumpy.

Another approach is to change the fit! method to catch the combination of Normal() and IdentityLink, fit the corresponding LinearMixedModel (a GeneralizedLinearMixedModel actually contains a LinearMixedModel structure) and dress it up as if it had been fit as a GLMM. I would probably also warn the user that this is a silly way of doing things.

Nosferican commented 5 years ago

I think I like the third approach the best. I agree with the assessment of the first and second approaches.

Would it be possible to have the GeneralizedLinearMixedModel fields with sensible values given that it is a linear mixed model? Doing a warning is fine. I guess since there isn't a straight way of treating the linear model as a generalized linear model, we could do a minor or patch version to support

fit(LinearMixedModel, formula, data, args..., kw...)

I can make the PR for it.

dmbates commented 5 years ago

What I was thinking of doing is supporting

fit!(GeneralizedLinearMixedModel(formula, data, Normal()))

by detecting that case in the constructor for GeneralizedLinearMixedModel and short-circuiting the fit! procedure. It would be necessary to fill out the fields in the GLMM struct but there is no need to worry about using a two stage optimization.

Let me draft something and you can take a look.

dmbates commented 5 years ago

I started to fake a GeneralizedLinearMixedModel for LinearMixedModel cases and it got sufficiently complicated and convoluted that I decided to punt and just throw an ArgumentError for that case. See the LMMasGLMM branch. Ler me know if you think it is important to patch this up rather than throw an error. I am perhaps less inclined than most to try to produce a result for a user, no matter how wrong-headed the request may be.

Nosferican commented 5 years ago

How about providing a generic dispatch for returning the appropriate model based on the model characteristics? For example,

MixedModel(formula, data, args..., kw...)

would return either a GeneralizedLinearMixedModel or a LinearMixedModel depending on whether the distribution is Normal with canonical link? I took a similar approach in Econometrics.jl (see example based on response type).

That way, users doesn't get the drawback of the first approach (i.e., calling one type struct and getting a different one), we can implement an error for "wrong-headed" approaches, but it would be a gentle push towards if you want this kind of behavior, use the generic approach.

dmbates commented 5 years ago

That's a good idea. I just committed a method for fit(MixedModel, f::FormulaTerm, dat, d::Distribution=Normal(); kw...) which seems to work for both LinearMixedModel and GeneralizedLinearMixedModel. Take a look at the changes in the tests.

On Fri, Aug 9, 2019 at 4:43 AM José Bayoán Santiago Calderón < notifications@github.com> wrote:

How about providing a generic dispatch for returning the appropriate model based on the model characteristics? For example,

MixedModel(formula, data, args..., kw...)

would return either a GeneralizedLinearMixedModel or a LinearMixedModel depending on whether the distribution is Normal with canonical link? I took a similar approach in Econometrics.jl (see example https://nosferican.github.io/Econometrics.jl/dev/estimators/#Ordinal-Response-Models-1 based on response type).

That way, users doesn't get the drawback of the first approach (i.e., calling one type struct and getting a different one), we can implement an error for "wrong-headed" approaches, but it would be a gentle push towards if you want this kind of behavior, use the generic approach.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dmbates/MixedModels.jl/issues/185?email_source=notifications&email_token=AAC2UOXW4J3MW4ARXMZCMM3QDU32JA5CNFSM4IITKT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36FZTI#issuecomment-519855309, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC2UOSJXSVPI6HS7N4U2Y3QDU32JANCNFSM4IITKT2A .

dmbates commented 5 years ago

@Nosferican How does the code in the LMMasGLMM branch look?

Nosferican commented 5 years ago

It looks good. I am going to add something to the docs about the feature and add any missing fit method. I can commit to it and open the PR.

dmbates commented 5 years ago

@Nosferican I have given you write permission on the repository if you want to work on this repository instead of on your own fork.

Nosferican commented 5 years ago

A few things I found,

I am leaning towards making all arguments keyword arguments except for the Distribution/Link.

dmbates commented 5 years ago

It is possible to use a Normal distribution with a non-Identity link although it is more of a curiosity than a common occurrence. Cook and Weisberg used a Normal distribution and an inverse link to fit the Michaelis-Menten model, Vm*c/(K+c) where Vm and K are the parameters to be fit and c, the concentration, is a covariate, in one of their books but I haven't seen other cases.

I agree that it would be best to standardize the arguments. I noticed weights vs wt myself. I guess the question is whether such a change would require a major version bump. I think going to v3.0.0 so soon after v2.0.0 would be exaggerating the pace of development. In some ways these changes are simply fixing "infelicities" in the original code. Is it possible to deprecate the original arguments, like wt and what does that entail for semantic versioning?

dmbates commented 5 years ago

I just saw something else that will eventually need to be addressed. The BlockArrays package changed after v0.7.0 to require the underlying arrays to be the same type. This package uses heterogeneous BlockArrays as shown by the output of describeblocks and the version of BlockArrays must be capped at v0.7.0.

Only a small portion of the capabilities of BlockArrays are used for A and L in a LinearMixedModel. It is probably best to remove the dependency on BlockArrays and write custom code to organize and access the blocks.

Nosferican commented 5 years ago

I guess it would be proper to handle the Normal distribution and canonical links as I think canonical links are more of exceptions rather than the rule (maybe Probit being the most widely used non-canonical link). I did thought about the implication of breaking major. On that issue, technically nothing in the docs and public API had documented the weights with LinearMixedModel and in one instance it actually gets ignored (which would be considered a bug). We could do a deprecation stage just in case it breaks anyone's code, but it might also be fair game since it wasn't explicitly documented and would only affect the v2.0.0 users. If treated as a non public API feature it could be a minor version as it gets new features or patch if it was meant to be the same name as the GeneralizedLinearModel, but wasn't.

Nosferican commented 5 years ago

As for the BlockArrays, it seems like capping it with BlockArrays = "0.7" should be good enough for now. That could be a patch release.

dmbates commented 5 years ago

@Nosferican I think you meant "non-canonical links are more the exception than the rule", in which case I agree

I guess it would be proper to handle the Normal distribution and canonical links as I think canonical links are more of exceptions rather than the rule (maybe Probit being the most widely used non-canonical link). I did thought about the implication of breaking major. On that issue, technically nothing in the docs and public API had documented the weights with LinearMixedModel and in one instance it actually gets ignored (which would be considered a bug). We could do a deprecation stage just in case it breaks anyone's code, but it might also be fair game since it wasn't explicitly documented and would only affect the v2.0.0 users. If treated as a non public API feature it could be a minor version as it gets new features or patch if it was meant to be the same name as the GeneralizedLinearModel, but wasn't.

dmbates commented 5 years ago

As for the BlockArrays, it seems like capping it with BlockArrays = "0.7" should be good enough for now. That could be a patch release.

It's already the case that BlockArrays is capped at "0.7". I'm concerned that at some point other packages may require later versions of BlockArrays creating version resolution problems. It seems that BlockArrays is going to stay with homogeneous array types so the code here should be modified to remove the dependency. All that is necessary is to create Matrix of AbstractMatrix elements for A and L.

Nosferican commented 5 years ago

Good point. It seems our dependencies might not use it, but it could be that some application uses MixedModels and BlockArrays.

Nosferican commented 5 years ago

I am leaning towards dispatching between LinearMixedModel and GeneralizedLinearMixedModel on whether there is a positional argument with a Distribution. That would be the distinction while it supports the explicit constructors as well. When using the explicit, the distribution positional argument will default to Normal() as it does currently.

dmbates commented 5 years ago

I just pushed a commit to the LMMasGLMM branch with this type of logic. There are warnings during precompilation

julia> using MixedModels
[ Info: Recompiling stale cache file /home/bates/.julia/compiled/v1.2/MixedModels/tBiYK.ji for MixedModels [ff71e718-51f3-5ec2-a782-8ffcbfa3c316]
WARNING: Method definition fit(Type{MixedModels.MixedModel{T} where T}, StatsModels.FormulaTerm{L, R} where R where L, Any) in module MixedModels at /home/bates/.julia/dev/MixedModels/src/mixed.jl:3 overwritten at /home/bates/.julia/dev/MixedModels/src/mixed.jl:3.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition fit(Type{MixedModels.MixedModel{T} where T}, StatsModels.FormulaTerm{L, R} where R where L, Any, Distributions.Distribution{F, S} where S<:Distributions.ValueSupport where F<:Distributions.VariateForm) in module MixedModels at /home/bates/.julia/dev/MixedModels/src/mixed.jl:3 overwritten at /home/bates/.julia/dev/MixedModels/src/mixed.jl:3.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition #fit(Any, typeof(StatsBase.fit), Type{MixedModels.MixedModel{T} where T}, StatsModels.FormulaTerm{L, R} where R where L, Any) in module MixedModels overwritten.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition #fit(Any, typeof(StatsBase.fit), Type{MixedModels.MixedModel{T} where T}, StatsModels.FormulaTerm{L, R} where R where L, Any, Distributions.Distribution{F, S} where S<:Distributions.ValueSupport where F<:Distributions.VariateForm) in module MixedModels overwritten.
  ** incremental compilation may be fatally broken for this module **

which I haven't worked out. Does this seem workable to you?

One of the problems with the methods may be the untyped tbl argument. Would it work better if it was {<:Tables.Table} in the method signature?

Nosferican commented 5 years ago
StatsBase.fit(::Type{LinearMixedModel}, f::FormulaTerm, tbl; kwargs...) =
    fit(::Type{LinearMixedModel}, f::FormulaTerm, Tables.columntable(tbl), kwargs...)
StatsBase.fit(::Type{LinearMixedModel},
              f::FormulaTerm,
              tbl::Tables.ColumnTable;
              kwargs...)

I think we should always have the main method work with tbl::Tables.ColumnTable.

dmbates commented 5 years ago

At least with current versions of DataFrames and Tables

julia> DataFrame <: Tables.ColumnTable
false
Nosferican commented 5 years ago

Per current behavior when the functions call Tables.columntable,

julia> data = DataFrame(a = 1:3, b = 2:4) |> Tables.columntable |> typeof
NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Int64,1}}}

So for design would this work?

# LinearMixedModel
fit(MixedModel, f::FormulaTerm, tbl::Tables.ColumnTable;
    wt, contrasts, verbose, REML)
# GeneralizedLinearMixedModel
fit(MixedModel, f::FormulaTerm, tbl::Tables.ColumnTable, d::Distribution,
    l::AbstractLink = canonicallink(d);
    wt, contrasts, offset, verbose, fast, nAGQ)
# LinearMixedModel
fit(MixedModel, f::FormulaTerm, tbl::Tables.ColumnTable, d::Normal, l::IdenitityLink;
    wt, contrasts, verbose, REML,
    offset, fast, nAGQ # These would be ignored
    )

I am not sure whether being explicit about the ColumnTable has a big effect on performance or consistency. In Econometrics.jl, I kept it as any and just assume it complies with the Tables.jl API.

Nosferican commented 5 years ago

I tried doing a constructor with MixedModel, but had some issues with the {T} so I skipped that for now and worked on the fit(::Type{MixedModel},...).

dmbates commented 5 years ago

I pushed a couple of changes to the LMMasGLMM to fix a failing test and to re-export some of the contrasts coding functions from StatsModels.

Nosferican commented 5 years ago

Sorry for the delay, I have pushed some changes to LMMasGLMM based on what we talked.

Nosferican commented 5 years ago

Closed by #186

dmbates commented 5 years ago

Thanks for closing these, @Nosferican . I am ready to make a new release. I just cleaned up some of the dependencies in the main Project.toml file and dropped the Appveyor checks for 32-bit. Is it okay with you to make a new release? I have it tagged as v2.0.1. Do you think it should be v2.1.0 because the recommended way of fitting models has changed?

Nosferican commented 5 years ago

Sure. I think since it is a new feature I would go for a minor release rather a patch.

dmbates commented 5 years ago

Okay. Anything else you want to put in? I would like to push this out soon so I can have a bit of time if problems arise after it hits the repository. I will be offline starting this weekend for a few days.

Nosferican commented 5 years ago

Sounds good. You might want to check the Normal and non-canonical link combinations which seem to be broken in some cases, but that can be a patch release whenever.