JuliaStats / GLM.jl

Generalized linear models in Julia
Other
595 stars 114 forks source link

Errors with >=30 variable logistic regression #212

Closed dsweber2 closed 6 years ago

dsweber2 commented 6 years ago

I initially ran into the problem on a more involved example, but here's a mwe:

using GLM, DataFrames
m=60
vlist = [Symbol("x$(i)") for i=1:m]
fml = Formula(:Y, Expr(:call, :+, vlist...))
X = [randn(40,m-1) zeros(40,1); randn(40,m)]
y = cat(1,ones(Int,40), zeros(Int,40))
X = DataFrame(X)
X[:Y] = y
pred = glm(fml, X, Binomial(), LogitLink())

The same code with m=20 works just fine, but increasing it to m=40 (and giving it maxIter=100), I get a isfinite(crit) assertion. Further, when I increase the number of variables to m=60, it gives a PosDefException.

Incidentally, if someone has a more elegant way of making a formula of arbitrary length, I'd be glad to see it.

dmbates commented 6 years ago

First, there is no need to use a formula in a generated example like this. Just leave X as a matrix and y as a vector then call

Main> fit(GeneralizedLinearModel, hcat(ones(size(X, 1)), X), y, Bernoulli())
ERROR: failure to converge after 30 iterations.
Stacktrace:
 [1] _fit!(::GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Bernoulli{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}, ::Bool, ::Int64, ::Float64, ::Float64, ::Void) at /home/bates/.julia/v0.6/GLM/src/glmfit.jl:214
 [2] fit!(::GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Bernoulli{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}) at /home/bates/.julia/v0.6/GLM/src/glmfit.jl:220
 [3] #fit#15(::Bool, ::Array{Float64,1}, ::Array{Float64,1}, ::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Distributions.Bernoulli{Float64}, ::GLM.LogitLink) at /home/bates/.julia/v0.6/GLM/src/glmfit.jl:278
 [4] #fit#16(::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{Int64,1}, ::Distributions.Bernoulli{Float64}, ::GLM.LogitLink) at /home/bates/.julia/v0.6/GLM/src/glmfit.jl:285
 [5] fit(::Type{GLM.GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{Int64,1}, ::Distributions.Bernoulli{Float64}) at /home/bates/.julia/v0.6/GLM/src/glmfit.jl:285
 [6] eval(::Module, ::Any) at ./boot.jl:235

Second, what do you expect to happen when the parameter vector is so large? You have exactly 80 bits of information in the response, y, and you expect to fit 61 coefficients on a continuous scale. It is unlikely that a such a model can produce reliable parameter estimates.

dsweber2 commented 6 years ago

The output is 1 whenever the last parameter is zero, so I was expecting it to find that without difficulty. I'm surprised that even if I make it linearly separable, I get the same errors. However, given that my actual ratio of data to parameters is still skewed in favor of the data, I'm fine with closing this.

nalimilan commented 6 years ago

Do you know of other software which handles this better? Just curious.