JuliaStats / GLM.jl

Generalized linear models in Julia
Other
595 stars 114 forks source link

predict gives incorrect CIs when collinear columns have been dropped #413

Open nalimilan opened 3 years ago

nalimilan commented 3 years ago

As noted at https://github.com/JuliaStats/GLM.jl/pull/410, predict appears to give incorrect results when some predictors are dropped because of collinearity. We should really do something about this.

julia> x = [1.0 1.0 2.0
            1.0 2.0 3.0
            1.0 -1.0 0.0];

julia> y = [1.0, 3.0, -2.0];

julia> m1 = lm(x, y, dropcollinear=true)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────
        Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────
x1   0.0       NaN         NaN       NaN     NaN        NaN
x2   2.07143     0.257539    8.04    0.0787   -1.20092    5.34378
x3  -0.428571    0.174964   -2.45    0.2468   -2.65169    1.79455
─────────────────────────────────────────────────────────────────

julia> m2 = lm(x[:, 1:2], y, dropcollinear=false)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
────────────────────────────────────────────────────────────────
        Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────
x1  -0.428571    0.174964  -2.45    0.2468  -2.65169     1.79455
x2   1.64286     0.123718  13.28    0.0479   0.070872    3.21484
────────────────────────────────────────────────────────────────

julia> p1 = predict(m1, x, interval=:confidence);

julia> p2 = predict(m2, x[:, 1:2], interval=:confidence);

julia> @test p1.prediction ≈ p2.prediction
Test Passed

julia> @test p1.upper ≈ p2.upper
Test Failed at REPL[254]:1
  Expression: p1.upper ≈ p2.upper
   Evaluated: [3.354342064529162; 5.728437323463455; 1.6152034019232366] ≈ [3.2437098232940285; 5.727181955909349; 1.200919478058661]
ERROR: There was an error during testing

julia> @test p1.lower ≈ p2.lower
Test Failed at REPL[255]:1
  Expression: p1.lower ≈ p2.lower
   Evaluated: [-0.925770635957734; -0.014151609177741165; -5.758060544780381] ≈ [-0.8151383947225999; -0.012896241623634452; -5.343776620915804]
ERROR: There was an error during testing

In the worst case we could throw an error when calling predict in such cases until we have a fix.

(BTW I'm not sure why the intercept is dropped in this example rather than the last column.)

Cc: @piever @pdeffebach @mkborregaard @dmbates @andreasnoack @Nosferican @DilumAluthge

pdeffebach commented 3 years ago

Thanks for highlighting this.

I can confirm that R does not have this behavior.

The error stems from this line where we work with the pivoted cholesky instead of the original matrix from the original regression.

I haven't tracked down exactly how R is handling it. But there is an interesting warning here

    p <- object$rank
    p1 <- seq_len(p)
    piv <- if(p) qr.lm(object)$pivot[p1]
    if(p < ncol(X) && !(missing(newdata) || is.null(newdata)))
    warning("prediction from a rank-deficient fit may be misleading")

I can trigger the warning


r$> predict(m1, as.data.frame(x), interval = "confidence")                 
        fit         lwr      upr
1  1.214286 -0.81513839 3.243710
2  2.857143 -0.01289624 5.727182
3 -2.071429 -5.34377662 1.200919
Warning message:
In predict.lm(m1, as.data.frame(x), interval = "confidence") :
  prediction from a rank-deficient fit may be misleading
nalimilan commented 3 years ago

Thanks. So apparently R doesn't know how to handle it either, so we should probably throw an error (I don't like warnings that tell you we return invalid results).