Open mousum-github opened 5 months ago
Moreover,
df_filip = CSV.read("filip.csv", DataFrame);
model_filip_qr = lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr,);
X = modelmatrix(model_fil_qr);
y = response(model_fil_qr);
rank(X)
# 10
rank(X'X)
# 5
It might be because of the limitation of LINPACK. Now let's try the same with BigFloat:
using GenericLinearAlgebra
X = big.(modelmatrix(model_filip_qr))
rank(X)
# 11
rank(X'X)
# 11
and finally,
inv(X'X)*X'y
11-element Vector{BigFloat}:
-1467.4896101225341941
-2772.1795838225322676
-2316.3710745083318604
-1127.9739373552426084
-354.47823250481932410
-75.1242014719792427
-10.8753179947227183
-1.0622149816811447
-0.0670191151786945
-0.0024678107718217
-4.0296252319e-05
I don't think it's surprising that a coefficient is dropped since the default is dropcollinear = true
and
julia> cond(X)
1.7679681910558162e15
It's surprising, though, that we error out when setting dropcollinear = false
julia> lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr, dropcollinear=false)
ERROR: RankDeficientException(10)
Stacktrace:
[1] delbeta!(p::GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}, r::Vector{Float64})
@ GLM ~/.julia/dev/GLM/src/linpred.jl:66
[2] fit!(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}})
@ GLM ~/.julia/dev/GLM/src/lm.jl:97
[3] fit(::Type{…}, f::FormulaTerm{…}, data::DataFrame, allowrankdeficient_dep::Nothing; wts::Nothing, dropcollinear::Bool, method::Symbol, contrasts::Dict{…})
@ GLM ~/.julia/dev/GLM/src/lm.jl:0
[4] fit
@ ~/.julia/dev/GLM/src/lm.jl:149 [inlined]
[5] lm
@ ~/.julia/dev/GLM/src/lm.jl:179 [inlined]
[6] top-level scope
@ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.
The problem is the check in https://github.com/JuliaStats/GLM.jl/blob/e2f6c980443fc34bf31c947efec254b863df64c3/src/linpred.jl#L66 which I don't think is a good idea. If the user has set dropcollinear = false
then we shouldn't try to outsmart her by computing a numerical rank. We should just check that there isn't a zero element in the diagonal of R
.
However, to my surprise, it looks like fixing this still isn't sufficient to match the reference estimates. It looks like LAPACK's pivoted QR loses some precision relative to the standard QR.
julia> qr(X, ColumnNorm()) \ df_filip.y
11-element Vector{Float64}:
9.01342654713381
1.6525459423443027
-5.767606630529379
-3.8636659613827122
-0.6703658653069542
0.1806043161686472
0.10552342922775357
0.0214449396217032
0.0022774832947497475
0.00012622643173822303
2.8896433329375396e-6
julia> qr(X) \ df_filip.y
11-element Vector{Float64}:
-1467.4896018652016
-2772.179569812206
-2316.3710641177045
-1127.9739329347135
-354.4782313151358
-75.12420126161706
-10.875317970213466
-1.0622149798557083
-0.06701911509851412
-0.002467810770121917
-4.029625231108671e-5
Q,R = qr(X);
diag(R)
11-element Vector{Float64}:
-9.055385138137417
-13.532654650686624
-21.825229067114073
30.328064942426202
44.4823844079297
61.77383426163982
90.2629854477562
-127.0559597329256
186.65576017620813
253.04777394199888
-373.39818100600036
rank(R)
10
I was also a bit surprised when I checked the diagonal elements of R and rank(R).
In fact, the inv
function generates the inverse of R
Rinv = inv(R)
Rinv * Rinv' * X'y
11-element Vector{Float64}:
-1467.489589088171
-2772.1795409834594
-2316.371035859374
-1127.973917018297
-354.4782255940765
-75.12419988749059
-10.875317746475703
-1.0622149554340137
-0.06701911338591168
-0.0024678107003603396
-4.029625105615225e-5
which is again a little bit different from what is generated by qr(X)\y
.
The QR based least squares solution is
julia> F = qr(X);
julia> F.R\(F.Q'*df_filip.y)[1:size(F, 2)]
11-element Vector{Float64}:
-1467.4896018652016
-2772.179569812206
-2316.3710641177045
-1127.9739329347135
-354.4782313151358
-75.12420126161706
-10.875317970213466
-1.0622149798557083
-0.06701911509851412
-0.002467810770121917
-4.029625231108671e-5
Oh. Now I realize what is going on in the pivoted case. There is a rank determination by default, so the relevant comparison is
julia> ldiv!(qr(X, ColumnNorm()), df_filip.y[:,:], 0.0)[1][1:size(X, 2)]
11-element Vector{Float64}:
-1467.4896118403967
-2772.1795879957995
-2316.371078886713
-1127.9739399822226
-354.4782335067247
-75.12420172645771
-10.875318038405625
-1.062214986693476
-0.06701911554715413
-0.002467810787511543
-4.029625261329208e-5
where the third argument to ldiv!
is the tolerance used when determining the numerical rank. To conclude, I think we just need to get rid of/adjust the check in https://github.com/JuliaStats/GLM.jl/blob/e2f6c980443fc34bf31c947efec254b863df64c3/src/linpred.jl#L66 then we'd be able to match the reference estimates when setting dropcollinear = false
.
filip.csv
In response to industrial concerns about the numerical accuracy of computations from statistical software, the Statistical Engineering and Mathematical and Computational Sciences Divisions of NIST's Information Technology Laboratory are providing datasets with certified values for various statistical methods
. One of them for linear regression is the Filip dataset. https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtmlFor this dataset, the
lm
function with theQR
decomposition method fails to estimate all parameterswhereas the
lm
function with theCholesky
decomposition method fails to estimate 6 parametersThe following table shows that the estimates from the
lm
function inR
are close to the certified values.