Support for a categoricalArray as the dependent variable for GLM models - Githubissues

JuliaStats / GLM.jl

Generalized linear models in Julia

Other

595 stars 114 forks source link

Support for a categoricalArray as the dependent variable for GLM models #240

Open mwsohn opened 6 years ago

mwsohn commented 6 years ago

Currently a categorical array on the left hand side in a GLM model generates an error message:

MethodError: no method matching zero(::Type{CategoricalArrays.CategoricalValue{Int8,UInt32}})[0m
Closest candidates are:
  zero([91m::Type{Base.LibGit2.GitHash}[39m) at libgit2\oid.jl:106
  zero([91m::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}[39m) at pkg\resolve\versionweight.jl:82
  zero([91m::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}[39m) at pkg\resolve\versionweight.jl:118
  ...
#fit#14(::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{CategoricalArrays.CategoricalValue{Int8,UInt32},1}, ::Distributions.Bernoulli{Float64}, ::GLM.LogitLink) at glmfit.jl:308
fit(::Type{GLM.GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{CategoricalArrays.CategoricalValue{Int8,UInt32},1}, ::Distributions.Bernoulli{Float64}) at glmfit.jl:308
#fit#44(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::StatsModels.Formula, ::DataFrames.DataFrame, ::Distributions.Bernoulli{Float64}, ::Vararg{Any,N} where N) at statsmodel.jl:72
fit(::Type{GLM.GeneralizedLinearModel}, ::StatsModels.Formula, ::DataFrames.DataFrame, ::Distributions.Bernoulli{Float64}) at statsmodel.jl:66
#glm#15(::Array{Any,1}, ::Function, ::StatsModels.Formula, ::DataFrames.DataFrame, ::Distributions.Bernoulli{Float64}, ::Vararg{Any,N} where N) at glmfit.jl:311
glm(::StatsModels.Formula, ::DataFrames.DataFrame, ::Distributions.Bernoulli{Float64}, ::Vararg{Distributions.Bernoulli{Float64},N} where N) at glmfit.jl:311
include_string(::String, ::String) at loading.jl:522
include_string(::String, ::String, ::Int64) at eval.jl:30
include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N} where N) at eval.jl:34
(::Atom.##102#107{String,Int64,String})() at eval.jl:82
withpath(::Atom.##102#107{String,Int64,String}, ::Void) at utils.jl:30
withpath(::Function, ::String) at eval.jl:38
hideprompt(::Atom.##101#106{String,Int64,String}) at repl.jl:67
macro expansion at eval.jl:80 [inlined]
(::Atom.##100#105{Dict{String,Any}})() at task.jl:80

Shouldn't there be a support for categorical arrays for regressions that take binary dependent variables?

nalimilan commented 6 years ago

I think we've mentioned this somewhere before. The question is whether we should assume that the reference is the first level, and the second level means that the event occurred. It could be more explicit to require people to write it explicitly. We should probably see whether other software follows a standard behavior in that situation.

dmbates commented 6 years ago

See https://github.com/JuliaStats/StatsModels.jl/pull/28

haberdashPI commented 3 years ago

In R, there is a default assumption based on the ordering of the categorical terms: that seems reasonable, and provides an obvious way to change what the reference level is. I think there is also a way to customize the reference level, as a parameter of the coding scheme, if I recall correctly (I don't remember the details, as I haven't used that option recently, I usually just re-order my categorical levels if I need to change the reference level). Seems like a reasonable design choice.

gaballench commented 3 years ago

Hello!

I am trying to use a binomial dependent categorical variable with glm() through @formula(Y ~ X_1 + X_2 + ... + X_n) and am getting some errors. First of all, I would like to cite the documentation https://juliastats.org/GLM.jl/stable/manual/#Categorical-variables-1 where it says that categorical variables are allowed, and the examples show its use for an independent variable. Although the documentation says that "The response (dependent) variable may not be categorical", the fact is that it should not (or cannot), as is shown below:

# Y needs to be Int64

julia> minidata = DataFrame(X=[1,2,2], Y=[1,0,1]) # Y is Array{Int64, 1}
3×2 DataFrame
 Row │ X      Y
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      0
   3 │     2      1

julia> typeof(minidata.Y)
Vector{Int64} (alias for Array{Int64, 1})

julia> model = glm(@formula(Y ~ X), minidata, Binomial(), ProbitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   9.63839     293.909   0.03    0.9738   -566.414    585.69
X            -4.81919     146.957  -0.03    0.9738   -292.849    283.211
────────────────────────────────────────────────────────────────────────

# Y is CategoricalArray
julia> minidata = DataFrame(X=[1,2,2], Y=categorical([1,0,1]))
3×2 DataFrame
 Row │ X      Y
     │ Int64  Cat…
─────┼─────────────
   1 │     1  1
   2 │     2  0
   3 │     2  1

julia> typeof(minidata.Y)
CategoricalVector{Int64, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}} (alias for CategoricalArray{Int64, 1, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}})

julia> model = glm(@formula(Y ~ X), minidata, Binomial(), ProbitLink())
ERROR: MethodError: no method matching fit(::Type{GeneralizedLinearModel}, ::Matrix{Float64}, ::Matrix{Float64}, ::Binomial{Float64}, ::ProbitLink)
Closest candidates are:
  fit(!Matched::StatisticalModel, ::Any...) at /home/balleng/.julia/packages/StatsBase/DU1bT/src/statmodels.jl:178
  fit(::Type{M}, ::Union{Matrix{T}, SparseArrays.SparseMatrixCSC{T, Ti} where Ti<:Integer}, !Matched::AbstractVector{var"#s52"} where var"#s52"<:Real, ::UnivariateDistribution{S} where S<:ValueSupport, ::GLM.Link; dofit, wts, offset, fitargs...) where {M<:GLM.AbstractGLM, T<:AbstractFloat} at /home/balleng/.julia/packages/GLM/Hvwti/src/glmfit.jl:452
  fit(::Type{M}, ::Union{Matrix{T} where T, SparseArrays.SparseMatrixCSC}, !Matched::AbstractVector{T} where T, ::UnivariateDistribution{S} where S<:ValueSupport, ::GLM.Link; kwargs...) where M<:GLM.AbstractGLM at /home/balleng/.julia/packages/GLM/Hvwti/src/glmfit.jl:476
  ...
Stacktrace:
 [1] fit(::Type{GeneralizedLinearModel}, ::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N; contrasts::Dict{Symbol, Any}, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StatsModels ~/.julia/packages/StatsModels/MDeyQ/src/statsmodel.jl:88
 [2] fit(::Type{GeneralizedLinearModel}, ::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::ProbitLink)
   @ StatsModels ~/.julia/packages/StatsModels/MDeyQ/src/statsmodel.jl:82
 [3] glm(::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:485
 [4] glm(::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N)
   @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:485
 [5] top-level scope
   @ none:1

Then, I found that Y also needs to take the values {0,1}:

# Y needs to take values {0,1}

julia> minidata = DataFrame(X=[1,2,2], Y=[1,0,1])
3×2 DataFrame
 Row │ X      Y
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      0
   3 │     2      1

# Y as dependent works because it takes {0,1}
julia> model = glm(@formula(Y ~ X), minidata, Binomial(), ProbitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   9.63839     293.909   0.03    0.9738   -566.414    585.69
X            -4.81919     146.957  -0.03    0.9738   -292.849    283.211
────────────────────────────────────────────────────────────────────────

# swapped X and Y because X  gets {1,2} and fails as dependent
julia> model = glm(@formula(X ~ Y), minidata, Binomial(), ProbitLink())
ERROR: ArgumentError: 2.0 in y is not in [0,1]
Stacktrace:
  [1] checky(y::Vector{Float64}, d::Binomial{Float64})
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:623
  [2] GLM.GlmResp(y::Vector{Float64}, d::Binomial{Float64}, l::ProbitLink, η::Vector{Float64}, μ::Vector{Float64}, off::Vector{Float64}, wts::Vector{Float64})
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:34
  [3] GLM.GlmResp(y::Vector{Float64}, d::Binomial{Float64}, l::ProbitLink, off::Vector{Float64}, wts::Vector{Float64})
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:55
  [4] GlmResp
    @ ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:63 [inlined]
  [5] fit(::Type{GeneralizedLinearModel}, X::Matrix{Float64}, y::Vector{Int64}, d::Binomial{Float64}, l::ProbitLink; dofit::Bool, wts::Vector{Int64}, offset::Vector{Int64}, fitargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:467
  [6] fit(::Type{GeneralizedLinearModel}, X::Matrix{Float64}, y::Vector{Int64}, d::Binomial{Float64}, l::ProbitLink)
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:463
  [7] fit(::Type{GeneralizedLinearModel}, ::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N; contrasts::Dict{Symbol, Any}, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ StatsModels ~/.julia/packages/StatsModels/MDeyQ/src/statsmodel.jl:88
  [8] fit(::Type{GeneralizedLinearModel}, ::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::ProbitLink)
    @ StatsModels ~/.julia/packages/StatsModels/MDeyQ/src/statsmodel.jl:82
  [9] glm(::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:485
 [10] glm(::FormulaTerm{Term, Term}, ::DataFrame, ::Binomial{Float64}, ::Vararg{Any, N} where N)
    @ GLM ~/.julia/packages/GLM/Hvwti/src/glmfit.jl:485
 [11] top-level scope
    @ none:1

It seems that in order to do e.g., a logistic regression we need to use Y as Vector{64} and also use 0,1 coding only. This probably could be included in the documentation when describing the use of categorical variables or probably add examples where Y may be categorical (perhaps I'm missing something about this), or maybe figure out if supporting Y as CategoricalArray in @formula(Y ~ .) can be implemented safely.

Thanks,

Gustavo