Closed githubtomtom closed 3 years ago
To reinforce this issue and add some detail, the following example is worked out with the cars93 data to illustrate.
using RDatasets, GLM
data=dataset("MASS", "Cars93")
lrm1 = lm(@formula(Weight ~ 1 ),data)
lrm2 = lm(@formula(Weight ~ 1 + Horsepower),data)
lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data)
lrm4 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase + RPM ),data)
lrm5 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase + RPM + MPGhighway ),data)
pred=predict(lrm1, data,interval=:confidence) #does not work, interestingly; reasonable
pred=predict(lrm2, data,interval=:confidence)
#=
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────
1 │ 3041.05 2958.49 3123.61
=#
pred=predict(lrm3, data,interval=:confidence)
#=
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 2942.12 -90050.4 95934.7
=#
pred=predict(lrm4, data,interval=:confidence)
#=
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────
1 │ 2770.86 2692.9 2848.83
=#
pred=predict(lrm5, data,interval=:confidence)
#=
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼───────────────────────────────────
1 │ 2785.23 -5.61804e6 5.62361e6
=#
I tried reproducing the predict() functionality line-by-line, and the issue seems to be the cholesky decomposition. I manually calculated it using the most naive method, and it worked fine.
newx = hcat(ones(length(data[:,1])), data.Horsepower, data.Wheelbase)
residvar=(ones(size(newx,2),1) * deviance(lrm3)/dof_residual(lrm3))
rr = (newx/cholesky(newx'newx).U).^2
conf = rr *residvar
retconf = quantile(TDist(dof_residual(lrm3)), (1. - .95)/2)*sqrt.(conf)
newx*coef(lrm3) .+ retconf # first value: 2899.5565296155874
newx*coef(lrm3) .- retconf # first value: 2984.692350159791
This is exactly the same result I get when defining lrm3 differently.
lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data,dropcollinear=false)
predict(lrm3,data,interval=:confidence)
#=
93×3 DataFrame
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────
1 │ 2942.12 2899.56 2984.69
=#
I'm not sure precisely why the cholesky decomposition in this code in the predict() function
chol = cholesky!(mm.pp)
# get the R matrix from the QR factorization
R = chol isa CholeskyPivoted ? chol.U[chol.p, chol.p] : chol.U
gives such a wrong answer, but the matrix it gives is completely different than the one calculated in the naive way -- but only in these examples where the confidence intervals fail to correspond with reasonable expectations.
This looks like a duplicate of https://github.com/JuliaStats/GLM.jl/issues/426 to me. Estimated coefficients are different when dropcollinear=true
. AFAICT the fact that prediction intervals are different is a consequence of this. So the root of the issue is that the problem is ill-conditioned and pivoted Cholesky isn't very good at handling that.
/edit - I should clarify that githubtomtom's example does look like 426, but the supporting example I provided does not have the same behavior with the coefficient estimates at all -- and could either be a distinct separate issue or related.
There is definitely something interesting going on in #426, but this is a different issue from what I can tell. If by 'estimated coefficients', you are meaning the linear coefficients for the features, then a different causal hypothesis for tracking this problem might be needed. An addition to my example above:
lrm3 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Weight ~ 1 + Horsepower + Wheelbase
Coefficients:
────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept) -3630.58 334.259 -10.86 <1e-17 -4294.64 -2966.51
Horsepower 4.63989 0.45144 10.28 <1e-16 3.74303 5.53676
Wheelbase 58.0698 3.46701 16.75 <1e-28 51.1819 64.9576
────────────────────────────────────────────────────────────────────────────────
lrm3_2 = lm(@formula(Weight ~ 1 + Horsepower + Wheelbase ),data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Weight ~ 1 + Horsepower + Wheelbase
Coefficients:
────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept) -3630.58 334.259 -10.86 <1e-17 -4294.64 -2966.51
Horsepower 4.63989 0.45144 10.28 <1e-16 3.74303 5.53676
Wheelbase 58.0698 3.46701 16.75 <1e-28 51.1819 64.9576
────────────────────────────────────────────────────────────────────────────────
As far as I can tell, they are extremely close. From the table, they are identical, with completely identical values in every field.
coef(lrm3) - coef(lrm3_2)
3-element Vector{Float64}:
-4.3155523599125445e-10
-2.1405099914773018e-13
4.433786671143025e-12
It does not seem like such a difference would even be easily detectable.
And just as a sanity check:
predict(lrm3) - predict(lrm3_2)
93-element Vector{Float64}:
-9.549694368615746e-12
3.501554601825774e-11
-1.6370904631912708e-11
1.8189894035458565e-12
⋮
-3.956301952712238e-11
5.002220859751105e-12
-1.8189894035458565e-12
(extremely close again)
And we still have this discrepancy in the confidence interval calculations:
predict(lrm3,data,interval=:confidence)
93×3 DataFrame
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 2942.12 -90050.4 95934.7
2 │ 3975.42 -1.28865e5 1.36816e5
⋮ │ ⋮ ⋮ ⋮
93 │ 3246.25 -108341.0 1.14833e5
predict(lrm3_2,data,interval=:confidence)
93×3 DataFrame
Row │ prediction lower upper
│ Float64? Float64? Float64?
─────┼────────────────────────────────
1 │ 2942.12 2899.56 2984.69
2 │ 3975.42 3896.27 4054.58
⋮ │ ⋮ ⋮ ⋮
93 │ 3246.25 3201.14 3291.36
If the linear coefficients of the features are slightly changed from e.g. fitting to a different sub-sample of the data, then I would expect a correspondent slight change in the confidence intervals. Or if some fundamental numerical instability was causing my coefficients to be drastically different (with drastically different p-values), I might expect nonsense confidence intervals to go along with it.
What I would not expect at all is tiny, tiny changes in the linear coefficients to bring on dramatic changes in the confidence interval. The problem can be traced precisely to lines 250-252 in lm.jl (the current Master version) where the cholesky decomposition is calculated in a way I found difficult to follow. If I reproduce the code in the predict() function line by line (as I did above) and replace that cholesky calculation with the naive calculation above, I get exactly the same answer as with dropcollinear=true option.
I regret I do not have the time to research what the issue is to its root, but I hope this is at least helpful.
A pairwise correlation plot of these particular features, just in case it is informative about collinearity.
Thanks. I've found the problem. The previous fix https://github.com/JuliaStats/GLM.jl/pull/410 which introduced these lines incorrectly took the permutation instead of the inverse permutation. That works for simple cases as these are the same so we didn't catch it immediately. See https://github.com/JuliaStats/GLM.jl/pull/436. It fixes your example, and make the results for the original example closer to dropcollinear=false
(they are not completely identical but given the size of confidence intervals around coefficients I think it's OK).
the following example shows that
predict()
is wrong whendropcollinear = true
::confidence
and:prediction
give the same predicitons