JuliaStats / GLMNet.jl

Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
Other
96 stars 35 forks source link

Logistic regression fails if y is a string of vectors #62

Open biona001 opened 2 years ago

biona001 commented 2 years ago

From README:

For logistic models, y is either a string vector or a m x 2 matrix

But the following doesn't work

using GLMNet
y = ["M", "B", "M", "B"]
X = rand(4, 10)
glmnet(X, y, Binomial())

MethodError: no method matching glmnet(::Matrix{Float64}, ::Vector{String}, ::Binomial{Float64})
Closest candidates are:
  glmnet(::AbstractMatrix{T} where T, ::AbstractVector{T} where T, ::AbstractVector{T} where T) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/CoxNet.jl:151
  glmnet(::AbstractMatrix{T} where T, ::AbstractVector{T} where T, ::AbstractVector{T} where T, ::CoxPH; kw...) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/CoxNet.jl:151
  glmnet(::Matrix{Float64}, ::Vector{Float64}, ::Distribution; kw...) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/GLMNet.jl:485
  ...

Fortunately if y is a matrix with 2 columns, it does work

y = [1 0; 0 1; 0 1; 1 0]
X = rand(4, 10)
glmnet(X, y, Binomial())

Logistic GLMNet Solution Path (100 solutions for 10 predictors in 833 passes):
────────────────────────────────
       df    pct_dev           λ
────────────────────────────────
  [1]   0  0.0        0.476672
  [2]   1  0.0582906  0.455006
  [3]   1  0.11166    0.434325
  [4]   1  0.160737   0.414585
  [5]   1  0.206039   0.395741
  [6]   1  0.248      0.377754
  [7]   1  0.286986   0.360585
  ...
JackDunnNZ commented 2 years ago

It looks like the method that supports the string-vector input is this one:

https://github.com/JuliaStats/GLMNet.jl/blob/8eff4c4f07374c6f6f7878b16dc02e90d444e9a1/src/Multinomial.jl#L191-L203

So this works:

using GLMNet
y = ["M", "B", "M", "B"]
X = rand(4, 10)
glmnet(X, y)

The reason it doesn't need a distribution is because it chooses between Binomial and Multinomial based on the number of unique values in y. This method could probably be extended to support passing a distribution, and I guess throwing an error if the distribution and y are incompatible.

At the very least the README should be updated