JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

`QR` decomposition method in GLM. #507

Closed mousum-github closed 1 year ago

mousum-github commented 1 year ago

It is well known that the Cholesky decomposition is fast but numerically unstable for ill-conditioned design/model matrix. Also, some users prefer the QR decomposition method in the linear (or generalized linear) model.

In this PR, we have added a keyword argument method in the lm, glm and related functions to control which decomposition method will be used. The method argument takes two values :cholesky (the default value and it uses Cholesky) and :qr (it uses QR).

Using QR decomposition in lm: Issue #426 lm(model, data; method=:qr)

Coefficients:

Loglikelihood: -917.7449640609644 r2: 0.46431799718947153 adjr2: 0.454016420212346

Issue #375 lm(@formula(y ~ x), df; method=:qr, dropcollinear=false)

Coefficients:

Summary of additions/changes:

To check and compare the performance of GLM with the QR method, we performed a logistic regression with 3,481,280 rows × 24 columns. Logistic regression took approximately (Intel i7 11th generation with 16GM RAM, Windows 11 OS),

Finally, @harsharora21 is one of the main contributors to this PR.

palday commented 1 year ago

@mousum-github for the method kwarg, I would be explicit and use :qr or :cholesky (instead of :stable and :fast) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. :smile:

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 97.43% and project coverage change: +1.62 :tada:

Comparison is base (b728917) 88.75% compared to head (7ebdacd) 90.37%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #507 +/- ## ========================================== + Coverage 88.75% 90.37% +1.62% ========================================== Files 8 8 Lines 1040 1132 +92 ========================================== + Hits 923 1023 +100 + Misses 117 109 -8 ``` | [Impacted Files](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats) | Coverage Δ | | |---|---|---| | [src/negbinfit.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL25lZ2JpbmZpdC5qbA==) | `82.66% <ø> (ø)` | | | [src/GLM.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL0dMTS5qbA==) | `50.00% <50.00%> (ø)` | | | [src/glmfit.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2dsbWZpdC5qbA==) | `81.91% <90.00%> (+0.11%)` | :arrow_up: | | [src/lm.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2xtLmps) | `93.78% <92.85%> (-0.48%)` | :arrow_down: | | [src/linpred.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/507?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2xpbnByZWQuamw=) | `98.38% <100.00%> (+7.21%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ayushpatnaikgit commented 1 year ago

@mousum-github for the method kwarg, I would be explicit and use :qr or :cholesky (instead of :stable and :fast) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smile

Shall we do a poll?

❤️ for :stable / :fast 👍 for :qr / :cholesky

We will add QR decomposition for glm after this, so it's best to get it right once and for all.

ViralBShah commented 1 year ago

@palday Just a note that it might be useful to use LinearSolve.jl for the linear solver APIs, especially once we have optional dependencies support.

ayushpatnaikgit commented 1 year ago

@mousum-github for the method kwarg, I would be explicit and use :qr or :cholesky (instead of :stable and :fast) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smile

Shall we do a poll?

heart for :stable / :fast +1 for :qr / :cholesky

We will add QR decomposition for glm after this, so it's best to get it right once and for all.

@mousum-github it looks like we should go ahead with :qr/:cholesky.

mousum-github commented 1 year ago

@mousum-github for the method kwarg, I would be explicit and use :qr or :cholesky (instead of :stable and :fast) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smile

Shall we do a poll? heart for :stable / :fast +1 for :qr / :cholesky We will add QR decomposition for glm after this, so it's best to get it right once and for all.

@mousum-github it looks like we should go ahead with :qr/:cholesky.

okay - we will change accordingly

mousum-github commented 1 year ago

Hi @nalimilan, @bkamins and others - looking for your suggestions.

bkamins commented 1 year ago

Can you please check why tests fail on nightly?

mousum-github commented 1 year ago

It is failing the following test @test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink()) in Saturated linear model test case

mousum-github commented 1 year ago

Initially, we thought to raise two PRs to incorporate the QR decomposition method in lm and glm. While preparing the QR method in glm after the QR method in lm, we found that both share almost 80% of the new code - hence a single PR.

ViralBShah commented 1 year ago

@mousum-github Why are the tests failing on master?

ayushpatnaikgit commented 1 year ago

@mousum-github Why are the tests failing on master?

@ViralBShah do you mean why the tests are failing on the nightly builds?

mousum-github commented 1 year ago

@mousum-github Why are the tests failing on master?

The following is the test which is failing in Julia nighly build

julia> using DataFrames, GLM, Test

julia> df = DataFrame(x1=["a", "b", "c"], x2=["a", "b", "c"], y=[1, 2, 3]);

julia> glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
GeneralizedLinearModel

y ~ 1 + x1 + x2

Coefficients:
──────────────────────────────────────────────────────────────────────
             Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept)    1.0         Inf    0.00    1.0000       -Inf        Inf
x1: b          1.0         Inf    0.00    1.0000       -Inf        Inf
x1: c          2.0         Inf    0.00    1.0000       -Inf        Inf
x2: b          0.0        NaN   NaN       NaN          NaN        NaN
x2: c          0.0        NaN   NaN       NaN          NaN        NaN
──────────────────────────────────────────────────────────────────────

julia> @test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
Error During Test at REPL[20]:1
  Expression evaluated to non-Boolean
  Expression: glm(#= REPL[20]:1 =# @formula(y ~ x1 + x2), df, Normal(), IdentityLink())
       Value: GeneralizedLinearModel

y ~ 1 + x1 + x2

Coefficients:
──────────────────────────────────────────────────────────────────────
             Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept)    1.0         Inf    0.00    1.0000       -Inf        Inf
x1: b          1.0         Inf    0.00    1.0000       -Inf        Inf
x1: c          2.0         Inf    0.00    1.0000       -Inf        Inf
x2: b          0.0        NaN   NaN       NaN          NaN        NaN
x2: c          0.0        NaN   NaN       NaN          NaN        NaN
──────────────────────────────────────────────────────────────────────

And the following is the output in the stable build

julia> using DataFrames, GLM, Test

julia> df = DataFrame(x1=["a", "b", "c"], x2=["a", "b", "c"], y=[1, 2, 3]);

julia> glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
GeneralizedLinearModel

y ~ 1 + x1 + x2

Coefficients:
──────────────────────────────────────────────────────────────────────
             Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept)    1.0         Inf    0.00    1.0000       -Inf        Inf
x1: b          1.0         Inf    0.00    1.0000       -Inf        Inf
x1: c          2.0         Inf    0.00    1.0000       -Inf        Inf
x2: b          0.0        NaN   NaN       NaN          NaN        NaN
x2: c          0.0        NaN   NaN       NaN          NaN        NaN
──────────────────────────────────────────────────────────────────────

julia> @test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
Test Broken
  Expression: glm(#= REPL[10]:1 =# @formula(y ~ x1 + x2), df, Normal(), IdentityLink())

The test is performed under the test case Saturated linear model with Cholesky in runtests.jl

bkamins commented 1 year ago

Section 6-3b from Wooldridge 7e is a good example when standard estimation fails. Maybe we could include it in tests to make sure it is resolved:

lm(@formula(rdintens ~ sales + sales^2), rdchem)

(dataset is rdchem)

mousum-github commented 1 year ago

Thanks, @bkamins. The following are the outputs from R and Julia: -

Output from R:

Call:
lm(formula = rdintens ~ sales + I(sales^2), data = rdchem)

Residuals:
            Min              1Q          Median              3Q             Max 
-2.141830034005 -1.362962168805 -0.225670654047  1.068750711730  5.580801329956 

Coefficients:
                       Estimate          Std. Error  t value  Pr(>|t|)    
(Intercept)  2.612512084995e+00  4.294417551587e-01  6.08351 1.267e-06 ***
sales        3.005713005396e-04  1.392952806164e-04  2.15780  0.039362 *  
I(sales^2)  -6.945939423917e-09  3.726138613681e-09 -1.86411  0.072458 .  

Outputs from Julia:

julia> mdl = lm(@formula(rdintens ~ sales + sales^2), rdchem; method=:cholesky)
LinearModel

rdintens ~ 1 + sales + :(sales ^ 2)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                    Coef.     Std. Error       t  Pr(>|t|)      Lower 95%     Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0          NaN            NaN       NaN     NaN            NaN
sales         0.000852509    0.000156784    5.44    <1e-05    0.000532313    0.00117271
sales ^ 2    -1.97385e-8     4.56287e-9    -4.33    0.0002   -2.90571e-8    -1.04199e-8
───────────────────────────────────────────────────────────────────────────────────────

and

julia> mdl = lm(@formula(rdintens ~ sales + sales^2), rdchem; method=:qr)
LinearModel

rdintens ~ 1 + sales + :(sales ^ 2)

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error      t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   2.61251      0.429442      6.08    <1e-05   1.73421     3.49082
sales         0.000300571  0.000139295   2.16    0.0394   1.56805e-5  0.000585462
sales ^ 2    -6.94594e-9   3.72614e-9   -1.86    0.0725  -1.45667e-8  6.7487e-10
─────────────────────────────────────────────────────────────────────────────────
mousum-github commented 1 year ago

Moreover:

julia> coef(mdl) ≈ [2.612512084994711e+00,  3.005713005395818e-04, -6.945939423916543e-09]
true

julia> stderror(mdl) ≈ [4.294417551587120e-01, 1.392952806163750e-04, 3.726138613680845e-09]
true

I will add a test case.

bkamins commented 1 year ago

Nice, thank you!

bkamins commented 1 year ago

In fact - I was thinking the following:

if we use default decomposition and in the output we get any value that is !isfinite and in the input data all values are isfinite maybe we should automatically try doing :qr as the second step? The rationale is that the first !isfinite test should be quite cheap to do (or maybe there is some other better way to detect that the default decomposition was not accurate enough?).

mousum-github commented 1 year ago

In that case, I will prefer to have something like the following in lm (and glm).

    if method === :auto
        decomobj = cholpred(X, dropcollinear)
        if rank(decomobj.col) == size(X, 2)
            fit!(LinearModel(LmResp(y, wts), decomobj, nothing))
        else
            fit!(LinearModel(LmResp(y, wts), qrpred(X, dropcollinear), nothing))

    if method === :cholesky
        fit!(LinearModel(LmResp(y, wts), cholpred(X, dropcollinear), nothing))
    elseif method === :qr
        fit!(LinearModel(LmResp(y, wts), qrpred(X, dropcollinear), nothing))
    else
        throw(ArgumentError("The only supported values for keyword argument `method` are `auto`, `:cholesky` and `:qr`."))
    end
bkamins commented 1 year ago

Yes - makes sense.

ViralBShah commented 1 year ago

Would be nice to see this one get merged.

mousum-github commented 1 year ago

Thanks, @bkamins - I will go through this today.

bkamins commented 1 year ago

@andreasnoack - could you please summarize here the conclusions from your investigation in #518? What are the conclusions for the design we should make (and I understand we have a tension between two kinds of errors we can make).

Do you think we should default to :cholesky and have :qr as opt in (without :auto as the default) as it would be breaking? Then maybe we should add a warning in case when :cholesky detects multicolinearity that user might want to check :qr? I am not 100% sure what is best.

ViralBShah commented 1 year ago

I think keeping cholesky default and warning when falling back to qr makes sense.

Simultaneously for changing the default, we should announce it as upcoming in the README for a whole release cycle (and a few months) and change it in the next release.

andreasnoack commented 1 year ago

could you please summarize here the conclusions from your investigation in https://github.com/JuliaStats/GLM.jl/pull/518?

Let me try. The mechanism for excluding variable when there is numerical collinearity introduced in https://github.com/JuliaStats/GLM.jl/pull/207 depends on a numerical rank determination (currently) based on the pivoted Cholesky decomposition. However, because of the famous squaring-of-the-condition-number-effect of X'X, there isn't much space between reduced rank because of bad scaling and reduced rank because of perfect multicollinearity. That is why the Wooldridge example can't currently be reproduced with GLM as discussed in https://github.com/JuliaStats/GLM.jl/pull/507#discussion_r1145045405 https://github.com/JuliaStats/GLM.jl/pull/507#discussion_r1147246862. So there are the following options

  1. Continue to use Cholesky and compute the numerical rank based on tol=-1. The pros are that the current multicollinearity detection examples will continue to work, it will be fast and memory efficient. The contra point is that many examples that will work in other software will drop variables in GLM.
  2. Continue to use Cholesky but set tol=0. This will be fast and memory efficient and only drop variables when strictly required. I.e. when X'X goes singular or slightly indefinite. Problems with multicollinearity will show up as very large confidence bounds and therefore require manual removal of explanatory variables. I like this approach but others might not.
  3. Use pivoted QR when dropcollinear=true but otherwise use Cholesky or PivotedCholesky with tol=0. That would ensure that the Wooldridge like examples work while also detecting perfect multicollinearity. The contra points here would be the extra cost of computing the QR both in flops and memory and the added complexity in switching algorithm based on the dropcollinear argument.
  4. Always use QR and use the pivoted if dropcollinear=true. This will have the same benefit as 3. regarding the detection of multicollinear variables but you'd pay the price of slower computations.
  5. Try catch on the Cholesky and fall back on QR. However, I think we know when things might fail so I'd think the options above are preferable.

As I argued in the thread above, I think the numerical error is often dominated by the statistical error and therefore tend to prefer the Cholesky because it's so much more faster. I hadn't realized the effect of the numerical rank determination that @bkamins mentioned, though.

bkamins commented 1 year ago

Thank you for a fantastic summary. I have now checked the docstring of dropcollinear:

dropcollinear controls whether or not lm accepts a model matrix which is less-than-full rank. If true (the default),
only the first of each set of linearly-dependent columns is used. The coefficient for redundant linearly dependent
columns is 0.0 and all associated statistics are set to NaN.

If I read it correctly we should drop columns when rank(X) < size(X, 2). The problem is that with tol=-1 in Cholesky this is not ensured. Are you aware if if we set tol=0 we get such equivalence?

nalimilan commented 1 year ago

How about merging the PR without changing the default nor adding :auto, and discuss that issue later? We don't need to make a decision about the default to add support for QR as an option.

bkamins commented 1 year ago

We don't need to make a decision about the default to add support for QR as an option.

I think we can do this, but we would need to document how QR differs from Cholesky. The point is that in examples we have it does differ because different rule for detection of multi-collinearity is used (i.e. Cholesky works without a problem, but has set tol=-1 which makes it decide that near-multicolinearity means dropping columns).

What I would want to avoid is leaving an impression that the difference is between QR and Cholesky as if in Cholesky we would add tol=0 option we would also get the same results.

For these reasons I asked @mousum-github if we have an example where indeed QR does produce a qualitatively different result than Cholesky on numerical level (not at rank determination level). I think there will be such cases, but I just wanted to make sure we understand clearly what the options are, what they mean and how to interpret them.

For example in the case from the changed docs we get:

julia> using GLM

julia> y = [1, 2, 3, 4, 5];

julia> x = [1.0E-12, 2.0E-12, 3.0E-12, 4.0E-12, 5.0E-12];

julia> mdl1 = lm(@formula(y ~ x), (; x, y); dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error                    t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)  -7.94411e-16  4.30465e-16                -1.85    0.1622  -2.16434e-15  5.75521e-16
x             1.0e12       0.00012979   7704748539664896.00    <1e-99   1.0e12       1.0e12
────────────────────────────────────────────────────────────────────────────────────────────────

and the result is different, but I could not say that worse than the one for QR reported in the docs.

nalimilan commented 1 year ago

I see. The most consistent approach would probably be to use QR by default (which implies tol=0 IIUC), use tol=0 by default for Cholesky when method=:cholesky is passed, and add an argument to set a different tolerance (or maybe that value can be passed to dropcollinear?). Or support two :cholesky variants, one which sets tol=0 and one which sets tol=-1. Not sure that's user-friendly though.

bkamins commented 1 year ago

I was also thinking for dropcollinear to support some third value (like "aggresive drop collinear").

mousum-github commented 1 year ago

An example where QR method performs better than Choleskly in estimation.

julia> Random.seed!(12345);

julia> X = rand(10, 2);

julia> Q,R = qr(X);

julia> R[1,1] = 1
1

julia> R[2,2] = 1.0E-7
1.0e-7

julia> X = Q*R;

julia> beta = [5, 10]
2-element Vector{Int64}:
  5
 10

julia> y = X*beta;

julia> md11 = lm(X, y; method=:qr, dropcollinear=false)
LinearModel

Coefficients:
───────────────────────────────────────────────────────────────────
    Coef.  Std. Error             t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────
x1    5.0  2.53054e-8  197586428.62    <1e-63        5.0        5.0
x2   10.0  1.59395e-8  627370993.89    <1e-99       10.0       10.0
───────────────────────────────────────────────────────────────────

julia> md21 = lm(X, y; method=:cholesky, dropcollinear=false)
LinearModel

Coefficients:
────────────────────────────────────────────────────────────────
       Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────
x1   5.28865   0.103248    51.22    <1e-10    5.05056    5.52674
x2  10.1818    0.0650348  156.56    <1e-14   10.0318    10.3318
────────────────────────────────────────────────────────────────

Whenever we set

R[2,2] = 1.0E-04

both generate the same estimates. Whenever we set

R[2,2] = 1.0E-08

Cholesky throws PosDefException. Whenever we set

R[2,2] = 1.0E-016

Multicollinearity is detected in QR also.

bkamins commented 1 year ago

OK. Thank you! So - @mousum-github - what do you feel would be a good option given the discussion we had?

ayushpatnaikgit commented 1 year ago

@bkamins I have sent you an invitation for write access to xKDR/GLM.jl In case you want to introduce changes to the PR directly. There was no other way, as the PR is made by an organisation.

mousum-github commented 1 year ago

Hi @bkamins, In my opinion let's have :cholesky as default since the Cholesky decomposition method is not as bad as we usually think. I started with Julia GLM creating hundreds of datasets from different distributions and link functions to test and be familiar with this module. I have tested with very large datasets, especially to test speed. Most of the time the Cholesky and R's QR gave the same results. In a very few cases where results differ, because of singularity detection. Most of the issues raised in GLM about the estimates are because of singularity detection. Additionally, as we discussed, we can provide (now or later) the :auto option to switch from :cholesky to :qr automatically whenever a singularity is detected by the Cholesky decomposition.

bkamins commented 1 year ago

I think it is OK. We just need to document that :cholesky eagerly drops near-collinear columns so if a user has such case and does not want them to be dropped :qr should be used.

bkamins commented 1 year ago

@mousum-github - I can review this PR, but please note that I am not a maintainer of GLM.jl, so someone from the team working with this package should approve and merge this PR.

droodman commented 8 months ago

As a user who is getting wrong results where R and Stata get it right, I think this is a fantastic addition. One suggestion: rename the method option to something less generic like invmethod (as in inversion method). Consider that FixedEffectModels.jl has a method option, also maybe too generically named, to control whether the CPU or GPU is used. I think that FixedEffectModels.jl also should gain the ability to request the QR; what should that option be called?

ararslan commented 8 months ago

For whatever it's worth, I think the use of method to choose the solution method is more appropriate than using method to choose where a computation is performed. The method keyword here in GLM.jl is the same as in R's lm. I don't know whether that was the motivator behind the name but it is pleasantly discoverable for folks coming from R.