Closed mousum-github closed 1 year ago
@mousum-github for the method kwarg, I would be explicit and use :qr
or :cholesky
(instead of :stable
and :fast
) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. :smile:
Patch coverage: 97.43
% and project coverage change: +1.62
:tada:
Comparison is base (
b728917
) 88.75% compared to head (7ebdacd
) 90.37%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@mousum-github for the method kwarg, I would be explicit and use
:qr
or:cholesky
(instead of:stable
and:fast
) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smile
Shall we do a poll?
❤️ for :stable
/ :fast
👍 for :qr
/ :cholesky
We will add QR
decomposition for glm
after this, so it's best to get it right once and for all.
@palday Just a note that it might be useful to use LinearSolve.jl for the linear solver APIs, especially once we have optional dependencies support.
@mousum-github for the method kwarg, I would be explicit and use
:qr
or:cholesky
(instead of:stable
and:fast
) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smileShall we do a poll?
heart for
:stable
/:fast
+1 for:qr
/:cholesky
We will add
QR
decomposition forglm
after this, so it's best to get it right once and for all.
@mousum-github it looks like we should go ahead with :qr
/:cholesky
.
@mousum-github for the method kwarg, I would be explicit and use
:qr
or:cholesky
(instead of:stable
and:fast
) and explain the difference in the docs. QR and Cholesky are after all not the only methods for solving the least squares problem. smileShall we do a poll? heart for
:stable
/:fast
+1 for:qr
/:cholesky
We will addQR
decomposition forglm
after this, so it's best to get it right once and for all.@mousum-github it looks like we should go ahead with
:qr
/:cholesky
.
okay - we will change accordingly
Hi @nalimilan, @bkamins and others - looking for your suggestions.
Can you please check why tests fail on nightly?
It is failing the following test
@test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
in Saturated linear model
test case
Initially, we thought to raise two PRs to incorporate the QR decomposition method in lm
and glm
. While preparing the QR method in glm
after the QR method in lm
, we found that both share almost 80% of the new code - hence a single PR.
@mousum-github Why are the tests failing on master?
@mousum-github Why are the tests failing on master?
@ViralBShah do you mean why the tests are failing on the nightly builds?
@mousum-github Why are the tests failing on master?
julia> using DataFrames, GLM, Test
julia> df = DataFrame(x1=["a", "b", "c"], x2=["a", "b", "c"], y=[1, 2, 3]);
julia> glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
GeneralizedLinearModel
y ~ 1 + x1 + x2
Coefficients:
──────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept) 1.0 Inf 0.00 1.0000 -Inf Inf
x1: b 1.0 Inf 0.00 1.0000 -Inf Inf
x1: c 2.0 Inf 0.00 1.0000 -Inf Inf
x2: b 0.0 NaN NaN NaN NaN NaN
x2: c 0.0 NaN NaN NaN NaN NaN
──────────────────────────────────────────────────────────────────────
julia> @test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
Error During Test at REPL[20]:1
Expression evaluated to non-Boolean
Expression: glm(#= REPL[20]:1 =# @formula(y ~ x1 + x2), df, Normal(), IdentityLink())
Value: GeneralizedLinearModel
y ~ 1 + x1 + x2
Coefficients:
──────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept) 1.0 Inf 0.00 1.0000 -Inf Inf
x1: b 1.0 Inf 0.00 1.0000 -Inf Inf
x1: c 2.0 Inf 0.00 1.0000 -Inf Inf
x2: b 0.0 NaN NaN NaN NaN NaN
x2: c 0.0 NaN NaN NaN NaN NaN
──────────────────────────────────────────────────────────────────────
julia> using DataFrames, GLM, Test
julia> df = DataFrame(x1=["a", "b", "c"], x2=["a", "b", "c"], y=[1, 2, 3]);
julia> glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
GeneralizedLinearModel
y ~ 1 + x1 + x2
Coefficients:
──────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept) 1.0 Inf 0.00 1.0000 -Inf Inf
x1: b 1.0 Inf 0.00 1.0000 -Inf Inf
x1: c 2.0 Inf 0.00 1.0000 -Inf Inf
x2: b 0.0 NaN NaN NaN NaN NaN
x2: c 0.0 NaN NaN NaN NaN NaN
──────────────────────────────────────────────────────────────────────
julia> @test_broken glm(@formula(y ~ x1 + x2), df, Normal(), IdentityLink())
Test Broken
Expression: glm(#= REPL[10]:1 =# @formula(y ~ x1 + x2), df, Normal(), IdentityLink())
The test is performed under the test case Saturated linear model with Cholesky
in runtests.jl
Section 6-3b from Wooldridge 7e is a good example when standard estimation fails. Maybe we could include it in tests to make sure it is resolved:
lm(@formula(rdintens ~ sales + sales^2), rdchem)
(dataset is rdchem
)
Thanks, @bkamins. The following are the outputs from R and Julia: -
Output from R:
Call:
lm(formula = rdintens ~ sales + I(sales^2), data = rdchem)
Residuals:
Min 1Q Median 3Q Max
-2.141830034005 -1.362962168805 -0.225670654047 1.068750711730 5.580801329956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.612512084995e+00 4.294417551587e-01 6.08351 1.267e-06 ***
sales 3.005713005396e-04 1.392952806164e-04 2.15780 0.039362 *
I(sales^2) -6.945939423917e-09 3.726138613681e-09 -1.86411 0.072458 .
Outputs from Julia:
julia> mdl = lm(@formula(rdintens ~ sales + sales^2), rdchem; method=:cholesky)
LinearModel
rdintens ~ 1 + sales + :(sales ^ 2)
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0 NaN NaN NaN NaN NaN
sales 0.000852509 0.000156784 5.44 <1e-05 0.000532313 0.00117271
sales ^ 2 -1.97385e-8 4.56287e-9 -4.33 0.0002 -2.90571e-8 -1.04199e-8
───────────────────────────────────────────────────────────────────────────────────────
and
julia> mdl = lm(@formula(rdintens ~ sales + sales^2), rdchem; method=:qr)
LinearModel
rdintens ~ 1 + sales + :(sales ^ 2)
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 2.61251 0.429442 6.08 <1e-05 1.73421 3.49082
sales 0.000300571 0.000139295 2.16 0.0394 1.56805e-5 0.000585462
sales ^ 2 -6.94594e-9 3.72614e-9 -1.86 0.0725 -1.45667e-8 6.7487e-10
─────────────────────────────────────────────────────────────────────────────────
Moreover:
julia> coef(mdl) ≈ [2.612512084994711e+00, 3.005713005395818e-04, -6.945939423916543e-09]
true
julia> stderror(mdl) ≈ [4.294417551587120e-01, 1.392952806163750e-04, 3.726138613680845e-09]
true
I will add a test case.
Nice, thank you!
In fact - I was thinking the following:
if we use default decomposition and in the output we get any value that is !isfinite
and in the input data all values are isfinite
maybe we should automatically try doing :qr
as the second step? The rationale is that the first !isfinite
test should be quite cheap to do (or maybe there is some other better way to detect that the default decomposition was not accurate enough?).
In that case, I will prefer to have something like the following in lm
(and glm
).
if method === :auto
decomobj = cholpred(X, dropcollinear)
if rank(decomobj.col) == size(X, 2)
fit!(LinearModel(LmResp(y, wts), decomobj, nothing))
else
fit!(LinearModel(LmResp(y, wts), qrpred(X, dropcollinear), nothing))
if method === :cholesky
fit!(LinearModel(LmResp(y, wts), cholpred(X, dropcollinear), nothing))
elseif method === :qr
fit!(LinearModel(LmResp(y, wts), qrpred(X, dropcollinear), nothing))
else
throw(ArgumentError("The only supported values for keyword argument `method` are `auto`, `:cholesky` and `:qr`."))
end
Yes - makes sense.
Would be nice to see this one get merged.
Thanks, @bkamins - I will go through this today.
@andreasnoack - could you please summarize here the conclusions from your investigation in #518? What are the conclusions for the design we should make (and I understand we have a tension between two kinds of errors we can make).
Do you think we should default to :cholesky
and have :qr
as opt in (without :auto
as the default) as it would be breaking? Then maybe we should add a warning in case when :cholesky
detects multicolinearity that user might want to check :qr
? I am not 100% sure what is best.
I think keeping cholesky
default and warning when falling back to qr
makes sense.
Simultaneously for changing the default, we should announce it as upcoming in the README for a whole release cycle (and a few months) and change it in the next release.
could you please summarize here the conclusions from your investigation in https://github.com/JuliaStats/GLM.jl/pull/518?
Let me try. The mechanism for excluding variable when there is numerical collinearity introduced in https://github.com/JuliaStats/GLM.jl/pull/207 depends on a numerical rank determination (currently) based on the pivoted Cholesky decomposition. However, because of the famous squaring-of-the-condition-number-effect of X'X
, there isn't much space between reduced rank because of bad scaling and reduced rank because of perfect multicollinearity. That is why the Wooldridge example can't currently be reproduced with GLM as discussed in https://github.com/JuliaStats/GLM.jl/pull/507#discussion_r1145045405 https://github.com/JuliaStats/GLM.jl/pull/507#discussion_r1147246862. So there are the following options
tol=-1
. The pros are that the current multicollinearity detection examples will continue to work, it will be fast and memory efficient. The contra point is that many examples that will work in other software will drop variables in GLM.tol=0
. This will be fast and memory efficient and only drop variables when strictly required. I.e. when X'X
goes singular or slightly indefinite. Problems with multicollinearity will show up as very large confidence bounds and therefore require manual removal of explanatory variables. I like this approach but others might not.dropcollinear=true
but otherwise use Cholesky or PivotedCholesky with tol=0
. That would ensure that the Wooldridge like examples work while also detecting perfect multicollinearity. The contra points here would be the extra cost of computing the QR both in flops and memory and the added complexity in switching algorithm based on the dropcollinear
argument.dropcollinear=true
. This will have the same benefit as 3. regarding the detection of multicollinear variables but you'd pay the price of slower computations.As I argued in the thread above, I think the numerical error is often dominated by the statistical error and therefore tend to prefer the Cholesky because it's so much more faster. I hadn't realized the effect of the numerical rank determination that @bkamins mentioned, though.
Thank you for a fantastic summary. I have now checked the docstring of dropcollinear
:
dropcollinear controls whether or not lm accepts a model matrix which is less-than-full rank. If true (the default),
only the first of each set of linearly-dependent columns is used. The coefficient for redundant linearly dependent
columns is 0.0 and all associated statistics are set to NaN.
If I read it correctly we should drop columns when rank(X) < size(X, 2)
. The problem is that with tol=-1
in Cholesky this is not ensured. Are you aware if if we set tol=0
we get such equivalence?
How about merging the PR without changing the default nor adding :auto
, and discuss that issue later? We don't need to make a decision about the default to add support for QR as an option.
We don't need to make a decision about the default to add support for QR as an option.
I think we can do this, but we would need to document how QR differs from Cholesky. The point is that in examples we have it does differ because different rule for detection of multi-collinearity is used (i.e. Cholesky works without a problem, but has set tol=-1
which makes it decide that near-multicolinearity means dropping columns).
What I would want to avoid is leaving an impression that the difference is between QR and Cholesky as if in Cholesky we would add tol=0
option we would also get the same results.
For these reasons I asked @mousum-github if we have an example where indeed QR does produce a qualitatively different result than Cholesky on numerical level (not at rank determination level). I think there will be such cases, but I just wanted to make sure we understand clearly what the options are, what they mean and how to interpret them.
For example in the case from the changed docs we get:
julia> using GLM
julia> y = [1, 2, 3, 4, 5];
julia> x = [1.0E-12, 2.0E-12, 3.0E-12, 4.0E-12, 5.0E-12];
julia> mdl1 = lm(@formula(y ~ x), (; x, y); dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) -7.94411e-16 4.30465e-16 -1.85 0.1622 -2.16434e-15 5.75521e-16
x 1.0e12 0.00012979 7704748539664896.00 <1e-99 1.0e12 1.0e12
────────────────────────────────────────────────────────────────────────────────────────────────
and the result is different, but I could not say that worse than the one for QR reported in the docs.
I see. The most consistent approach would probably be to use QR by default (which implies tol=0
IIUC), use tol=0
by default for Cholesky when method=:cholesky
is passed, and add an argument to set a different tolerance (or maybe that value can be passed to dropcollinear
?). Or support two :cholesky
variants, one which sets tol=0
and one which sets tol=-1
. Not sure that's user-friendly though.
I was also thinking for dropcollinear
to support some third value (like "aggresive drop collinear").
An example where QR method performs better than Choleskly in estimation.
julia> Random.seed!(12345);
julia> X = rand(10, 2);
julia> Q,R = qr(X);
julia> R[1,1] = 1
1
julia> R[2,2] = 1.0E-7
1.0e-7
julia> X = Q*R;
julia> beta = [5, 10]
2-element Vector{Int64}:
5
10
julia> y = X*beta;
julia> md11 = lm(X, y; method=:qr, dropcollinear=false)
LinearModel
Coefficients:
───────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────
x1 5.0 2.53054e-8 197586428.62 <1e-63 5.0 5.0
x2 10.0 1.59395e-8 627370993.89 <1e-99 10.0 10.0
───────────────────────────────────────────────────────────────────
julia> md21 = lm(X, y; method=:cholesky, dropcollinear=false)
LinearModel
Coefficients:
────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────
x1 5.28865 0.103248 51.22 <1e-10 5.05056 5.52674
x2 10.1818 0.0650348 156.56 <1e-14 10.0318 10.3318
────────────────────────────────────────────────────────────────
Whenever we set
R[2,2] = 1.0E-04
both generate the same estimates. Whenever we set
R[2,2] = 1.0E-08
Cholesky throws PosDefException. Whenever we set
R[2,2] = 1.0E-016
Multicollinearity is detected in QR also.
OK. Thank you! So - @mousum-github - what do you feel would be a good option given the discussion we had?
@bkamins I have sent you an invitation for write access to xKDR/GLM.jl In case you want to introduce changes to the PR directly. There was no other way, as the PR is made by an organisation.
Hi @bkamins,
In my opinion let's have :cholesky
as default since the Cholesky decomposition method is not as bad as we usually think. I started with Julia GLM creating hundreds of datasets from different distributions and link functions to test and be familiar with this module. I have tested with very large datasets, especially to test speed. Most of the time the Cholesky and R's QR gave the same results. In a very few cases where results differ, because of singularity detection. Most of the issues raised in GLM about the estimates are because of singularity detection.
Additionally, as we discussed, we can provide (now or later) the :auto
option to switch from :cholesky
to :qr
automatically whenever a singularity is detected by the Cholesky decomposition.
I think it is OK. We just need to document that :cholesky
eagerly drops near-collinear columns so if a user has such case and does not want them to be dropped :qr
should be used.
@mousum-github - I can review this PR, but please note that I am not a maintainer of GLM.jl, so someone from the team working with this package should approve and merge this PR.
As a user who is getting wrong results where R and Stata get it right, I think this is a fantastic addition.
One suggestion: rename the method
option to something less generic like invmethod
(as in inversion method). Consider that FixedEffectModels.jl
has a method
option, also maybe too generically named, to control whether the CPU or GPU is used. I think that FixedEffectModels.jl
also should gain the ability to request the QR; what should that option be called?
For whatever it's worth, I think the use of method
to choose the solution method is more appropriate than using method
to choose where a computation is performed. The method
keyword here in GLM.jl is the same as in R's lm
. I don't know whether that was the motivator behind the name but it is pleasantly discoverable for folks coming from R.
It is well known that the
Cholesky
decomposition is fast but numerically unstable for ill-conditioned design/model matrix. Also, some users prefer theQR
decomposition method in the linear (or generalized linear) model.In this PR, we have added a keyword argument
method
in thelm
,glm
and related functions to control which decomposition method will be used. Themethod
argument takes two values:cholesky
(the default value and it usesCholesky
) and:qr
(it usesQR
).Using QR decomposition in
lm
: Issue #426lm(model, data; method=:qr)
Coefficients:
Loglikelihood: -917.7449640609644 r2: 0.46431799718947153 adjr2: 0.454016420212346
Issue #375
lm(@formula(y ~ x), df; method=:qr, dropcollinear=false)
Coefficients:
Summary of additions/changes:
DensePredQR
- now supports both pivoted and unpivoted QR decompositioninvqr
- compute the inverse of X’WX from QR decompositioninverse
- a wrapper ofinvqr
andinvchol
, used inside thevcov
function.delbeta!(p::DensePredQR{T,<:QRCompactWY}, r::Vector{T}, wt::Vector{T})
delbeta!(p::DensePredQR{T,<:QRPivoted}, r::Vector{T})
delbeta!(p::DensePredQR{T,<:QRPivoted}, r::Vector{T}, wt::Vector{T})
dropcollinear
andmethod
keyword arguments innegbin
functionCholesky
andQR
decompositionsCholesky
andQR
decompositionsTo check and compare the performance of GLM with the QR method, we performed a logistic regression with 3,481,280 rows × 24 columns. Logistic regression took approximately (Intel i7 11th generation with 16GM RAM, Windows 11 OS),
Finally, @harsharora21 is one of the main contributors to this PR.