JuliaStats / Lasso.jl

Lasso/Elastic Net linear and generalized linear models
Other
143 stars 31 forks source link

The first time selectmodel is called it returns an invalid model, the second time fixes it #35

Closed oxinabox closed 4 years ago

oxinabox commented 4 years ago

A researcher came to me with a weird problem. It seems that the first model they get from selectmodel is invalid -- calling show on ti makes it throw an error. but if they call selectmodel again not only does the new model the get back show correctly but the first one starts working right too!

Setup

using Lasso

X = rand(9,8)
y = rand(9)

# this is to run this step: https://github.com/JuliaStats/Lasso.jl/blob/9d099f512afb6042b1094714e0ab280cf10f3863/src/segselect.jl#L276
#   I set λ to a single variable to simplify things
path = fit(LassoPath, X,y; λ=[0.005])

demo of errors

julia> # select model to replicate this step: https://github.com/JuliaStats/Lasso.jl/blob/9d099f512afb6042b1094714e0ab280cf10f3863/src/segselect.jl#L278
       m1 = selectmodel(path, MinAICc());

julia> show(m1)
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
 [1] macro expansion at /Users/oxinabox/.julia/packages/Distributions/wRw5p/src/utils.jl:6 [inlined]
 [2] Type at /Users/oxinabox/.julia/packages/Distributions/wRw5p/src/univariate/continuous/fdist.jl:30 [inlined]
 [3] Type at /Users/oxinabox/.julia/packages/Distributions/wRw5p/src/univariate/continuous/fdist.jl:35 [inlined]
 [4] Type at /Users/oxinabox/.julia/packages/Distributions/wRw5p/src/univariate/continuous/fdist.jl:37 [inlined]
 [5] #coeftable#1(::Float64, ::typeof(coeftable), ::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}}) at /Users/oxinabox/.julia/packages/GLM/ci8Mp/src/lm.jl:186
 [6] coeftable at /Users/oxinabox/.julia/packages/GLM/ci8Mp/src/lm.jl:183 [inlined]
 [7] show(::Base.TTY, ::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}}) at /Users/oxinabox/.julia/packages/GLM/ci8Mp/src/linpred.jl:227
 [8] show(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}}) at ./show.jl:325
 [9] top-level scope at REPL[51]:1

julia> m2 = selectmodel(path, MinAICc());

julia> show(m2)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}}:

Coefficients:
──────────────────────────────────────────────────────────────────────
      Estimate  Std. Error     t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
x1  -0.110053    10.0199    -0.0109834    0.9913  -20.0845    19.8644
x2  -0.116832     1.45942   -0.0800535    0.9364   -3.02614    2.79248
x3   0.478205     0.864751   0.552997     0.5820   -1.24565    2.20205
x4  -0.0911054    1.31273   -0.0694015    0.9449   -2.70799    2.52578
x5   0.0          1.72254    0.0          1.0000   -3.43382    3.43382
x6   0.285712     1.76747    0.161651     0.8720   -3.23767    3.80909
x7   0.114095     0.68929    0.165525     0.8690   -1.25998    1.48817
x8   0.380283     0.732736   0.51899      0.6054   -1.0804     1.84097
x9  -0.164557     1.53326   -0.107325     0.9148   -3.22106    2.89195
──────────────────────────────────────────────────────────────────────

julia> show(m1)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}}:

Coefficients:
──────────────────────────────────────────────────────────────────────
      Estimate  Std. Error     t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
x1  -0.110053    10.0199    -0.0109834    0.9913  -20.0845    19.8644
x2  -0.116832     1.45942   -0.0800535    0.9364   -3.02614    2.79248
x3   0.478205     0.864751   0.552997     0.5820   -1.24565    2.20205
x4  -0.0911054    1.31273   -0.0694015    0.9449   -2.70799    2.52578
x5   0.0          1.72254    0.0          1.0000   -3.43382    3.43382
x6   0.285712     1.76747    0.161651     0.8720   -3.23767    3.80909
x7   0.114095     0.68929    0.165525     0.8690   -1.25998    1.48817
x8   0.380283     0.732736   0.51899      0.6054   -1.0804     1.84097
x9  -0.164557     1.53326   -0.107325     0.9148   -3.22106    2.89195
──────────────────────────────────────────────────────────────────────

I have verified this with selectmodel(path, MinBIC()) and selectmodel(path, MinCVmse(path, 5)) also.

Here is the line that is erroring in GLM.jl GLM.jl: https://github.com/JuliaStats/GLM.jl/blob/master/src/lm.jl#L186

but I think this is not GLMs fault, I think it is right to be erroring?

oxinabox commented 4 years ago

Note that this only happens if number of features >= number observations -1 see

using Lasso

X = rand(20,8)
y = rand(20)

# this is to run this step: https://github.com/JuliaStats/Lasso.jl/blob/9d099f512afb6042b1094714e0ab280cf10f3863/src/segselect.jl#L276
#   I set λ to a single variable to simplify things
path = fit(LassoPath, X,y; λ=[0.005])

# select model to replicate this step: https://github.com/JuliaStats/Lasso.jl/blob/9d099f512afb6042b1094714e0ab280cf10f3863/src/segselect.jl#L278
m1 = selectmodel(path, MinAICc());
show(m1)
m2 = selectmodel(path, MinAICc());
show(m2)
show(m1)

works completely fine

AsafManela commented 4 years ago

Calculating standard errors for a lasso selected model are not a resolved problem AFAIK. I think show fails the first time because the standard errors are all 0. You can check by running stderror(m1). That it works on subsequent tries is a bug, that can be fixed by deep copying m.rr used in https://github.com/JuliaStats/Lasso.jl/blob/9d099f512afb6042b1094714e0ab280cf10f3863/src/segselect.jl#L169 instead of changing it with each call. Is there a way to make show display something sensible without taking a stand on standard errors?