JuliaStats / GLM.jl

Generalized linear models in Julia
Other
593 stars 114 forks source link

predict() gives wrong results when dropcollinear = true #431

Closed githubtomtom closed 3 years ago

githubtomtom commented 3 years ago

the following example shows that predict() is wrong when dropcollinear = true:

  1. :confidence and :prediction give the same predicitons
  2. lower and upper bounds are unreasonably wide
Random.seed!(1)
X = randn(10, 3);
X[:, 2] .= X[:, 1] .- X[:, 3];
y = randn(10);

julia> mod1 = GLM.lm(X, y; dropcollinear = false)
GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────
        Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────
x1   1.22558    3.65892e7   0.00    1.0000  -8.65198e7  8.65198e7
x2  -0.982437   3.65892e7  -0.00    1.0000  -8.65198e7  8.65198e7
x3  -0.5        3.65892e7  -0.00    1.0000  -8.65198e7  8.65198e7
─────────────────────────────────────────────────────────────────

julia> mod2 = GLM.lm(X, y; dropcollinear = true)
GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────
       Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────
x1  0.0       NaN         NaN       NaN     NaN         NaN
x2  0.243147    0.302187    0.80    0.4443   -0.453697    0.93999
x3  0.725583    0.439743    1.65    0.1375   -0.288465    1.73963
─────────────────────────────────────────────────────────────────

julia> GLM.predict(mod1, X[1:1, :], interval = :confidence)
(prediction = [0.13054650028641046], lower = [-0.10362885085068049], upper = [0.36472185142350144])
julia> GLM.predict(mod1, X[1:1, :], interval = :prediction)
(prediction = [0.13054650028641046], lower = [-2.4585555782536694], upper = [2.71964857882649])

julia> GLM.predict(mod2, X[1:1, :], interval = :confidence)
(prediction = [0.1305465002864105], lower = [-3.198245410614131e14], upper = [3.198245410614134e14])
julia> GLM.predict(mod2, X[1:1, :], interval = :prediction)
(prediction = [0.1305465002864105], lower = [-3.198245410614131e14], upper = [3.198245410614134e14])
ThuleanInsight commented 3 years ago

To reinforce this issue and add some detail, the following example is worked out with the cars93 data to illustrate.

using RDatasets, GLM
data=dataset("MASS", "Cars93")
lrm1 = lm(@formula(Weight ~ 1 ),data)
lrm2 = lm(@formula(Weight ~ 1 + Horsepower),data)
lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data)
lrm4 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase + RPM ),data)
lrm5 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase + RPM + MPGhighway ),data)

pred=predict(lrm1, data,interval=:confidence) #does not work, interestingly; reasonable

pred=predict(lrm2, data,interval=:confidence)
#=
 Row │ prediction  lower     upper    
     │ Float64?    Float64?  Float64? 
─────┼────────────────────────────────
   1 │    3041.05   2958.49   3123.61
=#

pred=predict(lrm3, data,interval=:confidence)
#=
  Row │ prediction  lower            upper         
     │ Float64?    Float64?         Float64?      
─────┼────────────────────────────────────────────
   1 │    2942.12   -90050.4        95934.7
=#

pred=predict(lrm4, data,interval=:confidence)
#=
 Row │ prediction  lower     upper    
     │ Float64?    Float64?  Float64? 
─────┼────────────────────────────────
   1 │    2770.86   2692.9    2848.83
=#

pred=predict(lrm5, data,interval=:confidence)
#=
 Row │ prediction  lower       upper     
     │ Float64?    Float64?    Float64?  
─────┼───────────────────────────────────
   1 │    2785.23  -5.61804e6  5.62361e6
=#

I tried reproducing the predict() functionality line-by-line, and the issue seems to be the cholesky decomposition. I manually calculated it using the most naive method, and it worked fine.

newx = hcat(ones(length(data[:,1])), data.Horsepower, data.Wheelbase)
residvar=(ones(size(newx,2),1) * deviance(lrm3)/dof_residual(lrm3))
rr = (newx/cholesky(newx'newx).U).^2
conf =  rr *residvar
retconf = quantile(TDist(dof_residual(lrm3)), (1. - .95)/2)*sqrt.(conf)

newx*coef(lrm3) .+ retconf # first value: 2899.5565296155874
newx*coef(lrm3) .- retconf # first value: 2984.692350159791

This is exactly the same result I get when defining lrm3 differently.

lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data,dropcollinear=false)
predict(lrm3,data,interval=:confidence)

#=
93×3 DataFrame
 Row │ prediction  lower     upper    
     │ Float64?    Float64?  Float64? 
─────┼────────────────────────────────
   1 │    2942.12   2899.56   2984.69
=#

I'm not sure precisely why the cholesky decomposition in this code in the predict() function

chol = cholesky!(mm.pp)
# get the R matrix from the QR factorization
R = chol isa CholeskyPivoted ? chol.U[chol.p, chol.p] : chol.U

gives such a wrong answer, but the matrix it gives is completely different than the one calculated in the naive way -- but only in these examples where the confidence intervals fail to correspond with reasonable expectations.

nalimilan commented 3 years ago

This looks like a duplicate of https://github.com/JuliaStats/GLM.jl/issues/426 to me. Estimated coefficients are different when dropcollinear=true. AFAICT the fact that prediction intervals are different is a consequence of this. So the root of the issue is that the problem is ill-conditioned and pivoted Cholesky isn't very good at handling that.

ThuleanInsight commented 3 years ago

/edit - I should clarify that githubtomtom's example does look like 426, but the supporting example I provided does not have the same behavior with the coefficient estimates at all -- and could either be a distinct separate issue or related.

There is definitely something interesting going on in #426, but this is a different issue from what I can tell. If by 'estimated coefficients', you are meaning the linear coefficients for the features, then a different causal hypothesis for tracking this problem might be needed. An addition to my example above:

lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Weight ~ 1 + Horsepower + Wheelbase

Coefficients:
────────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)  -3630.58      334.259    -10.86    <1e-17  -4294.64     -2966.51
Horsepower       4.63989     0.45144   10.28    <1e-16      3.74303      5.53676
Wheelbase       58.0698      3.46701   16.75    <1e-28     51.1819      64.9576
────────────────────────────────────────────────────────────────────────────────
lrm3_2 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Weight ~ 1 + Horsepower + Wheelbase

Coefficients:
────────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)  -3630.58      334.259    -10.86    <1e-17  -4294.64     -2966.51
Horsepower       4.63989     0.45144   10.28    <1e-16      3.74303      5.53676
Wheelbase       58.0698      3.46701   16.75    <1e-28     51.1819      64.9576
────────────────────────────────────────────────────────────────────────────────

As far as I can tell, they are extremely close. From the table, they are identical, with completely identical values in every field.

coef(lrm3) - coef(lrm3_2)
3-element Vector{Float64}:
 -4.3155523599125445e-10
 -2.1405099914773018e-13
  4.433786671143025e-12

It does not seem like such a difference would even be easily detectable.

And just as a sanity check:

predict(lrm3) - predict(lrm3_2)
93-element Vector{Float64}:
 -9.549694368615746e-12
  3.501554601825774e-11
 -1.6370904631912708e-11
  1.8189894035458565e-12
  ⋮
 -3.956301952712238e-11
  5.002220859751105e-12
 -1.8189894035458565e-12

(extremely close again)

And we still have this discrepancy in the confidence interval calculations:

 predict(lrm3,data,interval=:confidence)
 93×3 DataFrame
 Row │ prediction  lower            upper         
     │ Float64?    Float64?         Float64?      
─────┼────────────────────────────────────────────
   1 │    2942.12   -90050.4        95934.7
   2 │    3975.42       -1.28865e5      1.36816e5
  ⋮  │     ⋮              ⋮               ⋮
  93 │    3246.25  -108341.0            1.14833e5
predict(lrm3_2,data,interval=:confidence)
  93×3 DataFrame
 Row │ prediction  lower     upper    
     │ Float64?    Float64?  Float64? 
─────┼────────────────────────────────
   1 │    2942.12   2899.56   2984.69
   2 │    3975.42   3896.27   4054.58
  ⋮  │     ⋮          ⋮         ⋮
  93 │    3246.25   3201.14   3291.36

If the linear coefficients of the features are slightly changed from e.g. fitting to a different sub-sample of the data, then I would expect a correspondent slight change in the confidence intervals. Or if some fundamental numerical instability was causing my coefficients to be drastically different (with drastically different p-values), I might expect nonsense confidence intervals to go along with it.

What I would not expect at all is tiny, tiny changes in the linear coefficients to bring on dramatic changes in the confidence interval. The problem can be traced precisely to lines 250-252 in lm.jl (the current Master version) where the cholesky decomposition is calculated in a way I found difficult to follow. If I reproduce the code in the predict() function line by line (as I did above) and replace that cholesky calculation with the naive calculation above, I get exactly the same answer as with dropcollinear=true option.

I regret I do not have the time to research what the issue is to its root, but I hope this is at least helpful.

A pairwise correlation plot of these particular features, just in case it is informative about collinearity. asdf

nalimilan commented 3 years ago

Thanks. I've found the problem. The previous fix https://github.com/JuliaStats/GLM.jl/pull/410 which introduced these lines incorrectly took the permutation instead of the inverse permutation. That works for simple cases as these are the same so we didn't catch it immediately. See https://github.com/JuliaStats/GLM.jl/pull/436. It fixes your example, and make the results for the original example closer to dropcollinear=false (they are not completely identical but given the size of confidence intervals around coefficients I think it's OK).