GLM fails to match certified values for the Filip dataset

mousum-github commented 5 months ago

In response to industrial concerns about the numerical accuracy of computations from statistical software, the Statistical Engineering and Mathematical and Computational Sciences Divisions of NIST's Information Technology Laboratory are providing datasets with certified values for various statistical methods. One of them for linear regression is the Filip dataset. https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml

For this dataset, the lm function with the QR decomposition method fails to estimate all parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                    Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   8.13378        9.54815        0.85    0.3971  -10.9001        27.1677
x             0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        -7.14418       14.9533        -0.48    0.6343  -36.953         22.6647
x ^ 3        -4.53337       14.5308        -0.31    0.7560  -33.5           24.4333
x ^ 4        -0.881151       6.84421       -0.13    0.8979  -14.5248        12.7625
x ^ 5         0.135741       1.93585        0.07    0.9443   -3.7233         3.99479
x ^ 6         0.0989815      0.351359       0.28    0.7790   -0.601441       0.799403
x ^ 7         0.0207993      0.0413984      0.50    0.6169   -0.0617269      0.103326
x ^ 8         0.0022362      0.00307053     0.73    0.4688   -0.0038848      0.0083572
x ^ 9         0.000124681    0.000130509    0.96    0.3426   -0.000135484    0.000384846
x ^ 10.0      2.86391e-6     2.42701e-6     1.18    0.2419   -1.97424e-6     7.70207e-6

whereas the lm function with the Cholesky decomposition method fails to estimate 6 parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                   Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0          NaN            NaN       NaN     NaN            NaN
x            0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 3        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 4        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 5        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 6        0.00368644     0.000219028   16.83    <1e-26    0.0032503      0.00412258
x ^ 7        0.00191731     0.000127438   15.05    <1e-23    0.00166355     0.00217107
x ^ 8        0.000375885    2.75293e-5    13.65    <1e-21    0.000321067    0.000430703
x ^ 9        3.28191e-5     2.61667e-6    12.54    <1e-19    2.76087e-5     3.80296e-5
x ^ 10       1.07467e-6     9.23624e-8    11.64    <1e-17    8.90753e-7     1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────

The following table shows that the estimates from the lm function in R are close to the certified values.

	Certified Values	R Values	Julia (QR)	Julia (Cholesky)
Beta_0	-1467.4896142298000000	-1467.4895242210000000	8.1337808176521600	0.0000000000000000
Beta_1	-2772.1795919334200000	-2772.1794222850000000	0.0000000000000000	0.0000000000000000
Beta_2	-2316.3710816089300000	-2316.3709402080000000	-7.1441779010897300	0.0000000000000000
Beta_3	-1127.9739409837200000	-1127.9738723330000000	-4.5333687533072200	0.0000000000000000
Beta_4	-354.4782337033490000	-354.4782121956000000	-0.8811512968692300	0.0000000000000000
Beta_5	-75.1242017393757000	-75.1241971938900000	0.1357405515617050	0.0000000000000000
Beta_6	-10.8753180355343000	-10.8753173788800000	0.0989814591132414	0.0036864417075102
Beta_7	-1.0622149858894700	-1.0622149218240000	0.0207993361386816	0.0019173118731864
Beta_8	-0.0670191154593408	-0.0670191114167300	0.0022361990171227	0.0003758850326059
Beta_9	-0.0024678107827548	-0.0024678106336760	0.0001246810033160	0.0000328191326413
Beta_10	-0.0000402962525080	-0.0000402962500667	0.0000028639148280	0.0000010746699638

mousum-github commented 5 months ago

Moreover,

df_filip = CSV.read("filip.csv", DataFrame);
model_filip_qr = lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr,);
X =  modelmatrix(model_fil_qr);
y = response(model_fil_qr);

rank(X)
# 10
rank(X'X)
# 5

It might be because of the limitation of LINPACK. Now let's try the same with BigFloat:

using GenericLinearAlgebra
X =  big.(modelmatrix(model_filip_qr))
rank(X)
# 11
rank(X'X)
# 11

and finally,

inv(X'X)*X'y

11-element Vector{BigFloat}:
 -1467.4896101225341941
 -2772.1795838225322676
 -2316.3710745083318604
 -1127.9739373552426084
  -354.47823250481932410
   -75.1242014719792427
   -10.8753179947227183
    -1.0622149816811447
    -0.0670191151786945
    -0.0024678107718217
    -4.0296252319e-05

andreasnoack commented 5 months ago

I don't think it's surprising that a coefficient is dropped since the default is dropcollinear = true and

julia> cond(X)
1.7679681910558162e15

It's surprising, though, that we error out when setting dropcollinear = false

julia> lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr, dropcollinear=false)
ERROR: RankDeficientException(10)
Stacktrace:
 [1] delbeta!(p::GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}, r::Vector{Float64})
   @ GLM ~/.julia/dev/GLM/src/linpred.jl:66
 [2] fit!(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:97
 [3] fit(::Type{…}, f::FormulaTerm{…}, data::DataFrame, allowrankdeficient_dep::Nothing; wts::Nothing, dropcollinear::Bool, method::Symbol, contrasts::Dict{…})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:0
 [4] fit
   @ ~/.julia/dev/GLM/src/lm.jl:149 [inlined]
 [5] lm
   @ ~/.julia/dev/GLM/src/lm.jl:179 [inlined]
 [6] top-level scope
   @ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.

The problem is the check in https://github.com/JuliaStats/GLM.jl/blob/e2f6c980443fc34bf31c947efec254b863df64c3/src/linpred.jl#L66 which I don't think is a good idea. If the user has set dropcollinear = false then we shouldn't try to outsmart her by computing a numerical rank. We should just check that there isn't a zero element in the diagonal of R.

However, to my surprise, it looks like fixing this still isn't sufficient to match the reference estimates. It looks like LAPACK's pivoted QR loses some precision relative to the standard QR.

julia> qr(X, ColumnNorm()) \ df_filip.y
11-element Vector{Float64}:
  9.01342654713381
  1.6525459423443027
 -5.767606630529379
 -3.8636659613827122
 -0.6703658653069542
  0.1806043161686472
  0.10552342922775357
  0.0214449396217032
  0.0022774832947497475
  0.00012622643173822303
  2.8896433329375396e-6

julia> qr(X) \ df_filip.y
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

mousum-github commented 5 months ago

Q,R = qr(X);
diag(R)
11-element Vector{Float64}:
   -9.055385138137417
  -13.532654650686624
  -21.825229067114073
   30.328064942426202
   44.4823844079297
   61.77383426163982
   90.2629854477562
 -127.0559597329256
  186.65576017620813
  253.04777394199888
 -373.39818100600036

rank(R)
10

I was also a bit surprised when I checked the diagonal elements of R and rank(R). In fact, the inv function generates the inverse of R

Rinv = inv(R)
Rinv * Rinv' * X'y
11-element Vector{Float64}:
 -1467.489589088171
 -2772.1795409834594
 -2316.371035859374
 -1127.973917018297
  -354.4782255940765
   -75.12419988749059
   -10.875317746475703
    -1.0622149554340137
    -0.06701911338591168
    -0.0024678107003603396
    -4.029625105615225e-5

which is again a little bit different from what is generated by qr(X)\y.

andreasnoack commented 5 months ago

The QR based least squares solution is

julia> F = qr(X);

julia> F.R\(F.Q'*df_filip.y)[1:size(F, 2)]
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

andreasnoack commented 5 months ago

Oh. Now I realize what is going on in the pivoted case. There is a rank determination by default, so the relevant comparison is

julia> ldiv!(qr(X, ColumnNorm()), df_filip.y[:,:], 0.0)[1][1:size(X, 2)]
11-element Vector{Float64}:
 -1467.4896118403967
 -2772.1795879957995
 -2316.371078886713
 -1127.9739399822226
  -354.4782335067247
   -75.12420172645771
   -10.875318038405625
    -1.062214986693476
    -0.06701911554715413
    -0.002467810787511543
    -4.029625261329208e-5

where the third argument to ldiv! is the tolerance used when determining the numerical rank. To conclude, I think we just need to get rid of/adjust the check in https://github.com/JuliaStats/GLM.jl/blob/e2f6c980443fc34bf31c947efec254b863df64c3/src/linpred.jl#L66 then we'd be able to match the reference estimates when setting dropcollinear = false.

JuliaStats / GLM.jl

GLM fails to match certified values for the Filip dataset #558