Incorrect linear regression results

xianwenchen commented 3 years ago

This is a bit urgent.

In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models.

Here are the Julia codes:

using GLM
using DataFrames
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )

The output is the following:

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-7    -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

The intercept is not estimated. The other two coefficients are not correct.

I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below:

R> data <- read.csv("Car-Training.csv")
R> lm(Price ~ Year + Mileage, data = data )

Call:
lm(formula = Price ~ Year + Mileage, data = data)

R> summary( lm(Price ~ Year + Mileage, data = data ) )

Call:
lm(formula = Price ~ Year + Mileage, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
 -2168   -835    -49    567   5402 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.19e+06   3.52e+05   -6.21  1.1e-08
Year         1.10e+03   1.75e+02    6.25  9.0e-09
Mileage     -2.38e-02   9.84e-03   -2.42    0.017

Residual standard error: 1300 on 104 degrees of freedom
Multiple R-squared:  0.464,     Adjusted R-squared:  0.454 
F-statistic: 45.1 on 2 and 104 DF,  p-value: 8e-15

Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?

xianwenchen commented 3 years ago

I forgot to say that I am using Julia 1.5.3 on Linux and 1.5.4 on Windows. Both produced same results.

xianwenchen commented 3 years ago

@nalimilan, could you have a quick look at the issue and confirm or reject that it is a problem of GLM?

xianwenchen commented 3 years ago

I also started a discussion on Discourse: https://discourse.julialang.org/t/can-someone-replicate-this-glm-problem-with-linear-regression-on-your-computer/60098

rafael.guerra found out that by subtracting 2000 from Year, the regression was estimated correctly. He wonders if the problem was due to a normalization error.

I think now it becomes more likely that the problem was within GLM.jl.

I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.

And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.

I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.

pdeffebach commented 3 years ago

As mentioned on discourse, this stems from dropcollinear=true in lm. For some reason, it's thinking that the intercept and other variables are collinear, so it doesn't estimate the intercept term.

Oddly, it doesn't fail with dropcollinear=false, but rather gives the same (correct) results as in R.

julia> data = CSV.File("Car-Training.csv");

julia> model = @formula( Price ~ Year + Mileage);

julia> lm(model, data, dropcollinear=true)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-07   -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

julia> lm(model, data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Price ~ 1 + Year + Mileage

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                    Coef.    Std. Error      t  Pr(>|t|)    Lower 95%      Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)    -2.18759e6    3.52442e5   -6.21    <1e-07   -2.8865e6     -1.48868e6
Year         1096.15       175.282        6.25    <1e-08  748.558      1443.74
Mileage        -0.0238368    0.00984145  -2.42    0.0172   -0.0433527    -0.00432081
────────────────────────────────────────────────────────────────────────────────────

xianwenchen commented 3 years ago

Thank you. This indeed solved the issue. I will prepare a note for students so that they are aware.

As I have commented on Discourse, I’m not sure if this is a sane option to set dropcollinear=true as the default. I have used a number of statistical software. This is the first time such an option is set as the default.

xianwenchen commented 3 years ago

As you have mentioned on Discourse, Stata implements dropcollinear=true. This means that my previous comment was invalid.

pdeffebach commented 3 years ago

The cause is this line: https://github.com/JuliaStats/GLM.jl/blob/16268be9eee54769b72ac36d17e5fd38337c0aae/src/linpred.jl#L107

I don't understand cholesky factorization well enough. It's odd that cholesky! with Val(true) doesn't return roughly the same output as cholesky! when the input matrix is of full rank. Should we check the rank ourself choose cholesky!(F) if collinear is true?

cc @andreasnoack

xianwenchen commented 3 years ago

Just curious. How do you debug in Julia? Is there a track function? Or do you just print a lot of intermediate results to see where it went wrong?

pdeffebach commented 3 years ago

In this instance I had a feeling and knew the library well enough to know what functions were being called.

But for debugging I generally use Exfiltrator.jl and then lots of printing and "break points" i.e. asdf in the code somewhere. I'm not a very sophisticated debugger.

xianwenchen commented 3 years ago

Thank you!

xianwenchen commented 3 years ago

I'd like to try to help. I don't fully understand the code. Could you rewrite F = pivot ? cholesky!(F, Val(true), tol = -one(T), check = false) : cholesky!(F) into several lines here?

cholesky!(F, Val(true), tol = -one(T), check = false) seems to be a function called cholesky() that takes four parameters.

A ! is added after cholesky. I don't remember what ! does in Julia. However, I think I can find an answer myself to this one.

What does the :cholesky!(F) do? Especially, what does : mean here?

And what does the pivot ? do, right after the = sign?

pdeffebach commented 3 years ago

There are two distinct misunderstandings here.

The first is Julia syntax. The syntax

x = a ? b : c

is the same as

x = begin 
    if a == true 
        b
    else
        c
    end
end

so in this case pivot is a boolean.

The ! in cholesky! is not syntax, but rather a naming convention. We name functions to end in ! if it modifies the input. cholesky!(F, ...) means the function overwrites F. \

The next misunderstanding is what's going on mathematically. To better understand why we call cholesky, check out this short explainer.

nalimilan commented 3 years ago

It seems that the default tolerance (tol=-1) is too large so the decomposition fails:

julia> X = results.mm.m;

julia> F = Hermitian(float(X'X));

julia> cholesky(F, Val(true), tol=1e-5, check=true)
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 3:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 3.6764e5  18721.7   9.31674
  ⋅         9036.63  4.49425
  ⋅             ⋅    0.00369674
permutation:
3-element Vector{Int64}:
 3
 2
 1

julia> cholesky(F, Val(true), tol=1e-4, check=true)
ERROR: RankDeficientException(1)
[...]

julia> cholesky(F, Val(true), tol=-1, check=true)
ERROR: RankDeficientException(1)
[...]

The default tolerance is defined like this:

          TOL is DOUBLE PRECISION
          User defined tolerance. If TOL < 0, then N*U*MAX( A(K,K) )
          will be used. The algorithm terminates at the (K-1)st step
          if the pivot <= TOL.

pdeffebach commented 3 years ago

R uses tol = 1e-7. See here.

But they aren't calling LAPACK. Rather, they have their own decomposition code defined in lm.c.

nalimilan commented 3 years ago

That code seems to use the QR decomposition, so do you think we can compare the tolerance?

pdeffebach commented 3 years ago

Tha's a good point. Let's try fixest whic uses Cholesky.

They define their own cholesky factorization here.

But their tol keyword argument works differently than Julia's.

EDIT: I did this wrong.

There's some sort of scaling that goes on such that the keyword argument passed to feols isn't the same as the one passed to cpp_cholesky.

I wonder if it's worth pinging Laurent on this. He's been saying he's doing more things in Julia.

xianwenchen commented 3 years ago

Thank you.

While I am not able to fully follow the technical discussion, I am grateful to see the progress.

I think part of the issue is from the fact that the Year variable in my exam dataset is quite invariant. The values are very close to the mean and does not change more than around 1% of the mean.

This triggers the question of how much tol should be.

Below I create a new variable that follows a normal distribution, which has a larger mean than Year and smaller deviation as compared to the mean.

using GLM
using DataFrames
using Distributions
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )

sample_size = length( data.Year )

data.NewVar = rand( Normal( 100000, 1 ), sample_size )

model = @formula( Price ~ NewVar + Mileage )
results = lm( model, data )
results = lm( model, data, dropcollinear = false )

Could you test whether an even smaller tol is required to correctly estimate the model?

Is tol a parameter to the lm() function? If it seems that specifying tol is required to make the algorithm work at special cases, it may be a good idea to let users specify tol in the lm() function.

What is the practical implication of changing tol value from 1 to 1e-5 or 1e-7 temporarily?

I am wondering if it is possible to release a new version of GLM.jl, with tol set to for example 1e-7? If that's possible, I can ask students to simply update their GLM.jl, for the take-home exam that they are working on, which uses the dataset that I posted here. The benefits will be that the disturbance to the exam will be small, and that a more proper value of tol or a more proper way of handling this issue could be addressed in a future release of GLM.jl.

pdeffebach commented 3 years ago

While changing the tol option to a smaller default or allowing it as a keyword argument are likely solutions to this problem, it's not going to happen in a release in time for your student's exam.

I would suggest creating a variable of years since 2008, i.e. df.Year = df.Year .- minimum(df.Year).

This is a bummer of a bug, though. It's unfortunate that we are uncovering this behavior, but thank you for bringing it to our attention.

pdeffebach commented 3 years ago

Digging into the git blame on this, it looks like @dmbates originally decided the tolerance for cholesky! and set it equal to -1. So he is also a good person to ask about that number in particular.

palday commented 3 years ago

@pdeffebach That threshold stems from the days when that was the threshold in LinearAlgebra for the QR and Cholesky decompositions (cf this discussion).

Changing the tolerance won't always fix the problem here though because if you're that ill conditioned, then you may or may not hit the threshold on different architectures. (Floating point math isn't associative so e.g. using AVX and other SIMD instructions can change the result. This led to intermittent test failures in MixedModels depending on whether or not the test was running on a machine with AVX512 support, which is what motivated the discussion that I linked.) As mentioned in #375, the extra precision from the QR decomposition can sometimes help, albeit at a slight performance penalty.

But the best choice here is an affine transformation of Year so that all the variables are closer in scale and the coefficients are more interpretable (e.g. the intercept is not a measurement of "year 0", which is well before there were cars). Understanding that aspect -- interpretability of model coefficients as well as the existence of numeric issues in practice -- is something that I think is also very important for an applied statistics course.

palday commented 3 years ago

All that said, I think we should probably

expose the tolerances
make it easier to choose the QR or Cholesky routes
add to the documentation why having two algorithms is important

andreasnoack commented 3 years ago

I agree with @palday here. In particular, I believe that age is a more meaningful variable here. It doesn't change the general issue though. Numerical rank determination is not clear-cut.

I just want to mention that the QR wouldn't make a difference for this particular problem

julia> cond(cholesky(Symmetric(X'X), Val(true)).U)
9.957869521461289e7

julia> cond(qr(X, Val(true)).R)
9.957869522989982e7

Notice that both correspond to reduced rank with respect to the R tolerance 1e-7. The main difference is the pivoting strategy (and rank determination) used by the custom pivoted QR algorithm in R. I'm not aware of a numerical analysis of the algorithm used by R. It would be interesting to see such an analysis.

pdeffebach commented 3 years ago

The numerical concerns are important, but I'm not super happy with a situation where lm fails in Julia when it doesn't fail in R. I'm sure there are edge cases where it fails in R as well, and we should aim for our failures to be a subset of their failures.

Could we change the tolerance to 1e-5? There is a instability-robustness tradeoff, and i'm not sure at exactly which parameter value that really kicks in.

This problem was caused by me, making dropcollinear = true the default in #386. I think it's important that this problem does not occur when dropcollinear=false. This happens because cholesky (the method called when dropcollinear=false) calls cholesky!(H, val(true), tol = 0.0, check=false). We were already enforcing a very strict tolerance before that change and things were working okay.

Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?

Of course, we can always revert dropcollinear=true. The goal of that PR was to make GLM easier for newcomers, but explaining rank deficiency is easier than explaining numerical errors in matrix decompositions.

palday commented 3 years ago

@pdeffebach The numerical issues won't go away. Poorly conditioned models will often have misleading standard errors and other pathologies. I'm pretty strongly opposed to making things "too easy" because there will always be cases where these things come up the surface. Model diagnostics (does my model seem to actually fit my data? did something go wrong in fitting the model?) and centering and scaling covariates should be applied stats 101. Users don't have to know how our internal numerics work, but they should understand that numerical issues can and do arise in real-world statistical problems and that there are ways to check for false convergence (diagnostics) and ways to help the computer (centering and scaling).

Note in your historical counterexample, there's a problem: check=false. That check disables the error checking and just assumes that the decomposition worked at a certain tolerance.

palday commented 3 years ago

Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?

QR with Pivot, already discussed above.

Singular Value Decomposition, but this is expensive.

We could look at the condition number to guess whether or not a given calculation is likely to be ill-conditioned, but that's also an extra non trivial operation and doesn't tell what to do, just that whatever we do, there might be problems.

We have a whole page explaining how numerical rank deficiency crops up in mixed models and how rank deficiency is well defined in theory but really, really challenging in practice: https://juliastats.org/MixedModels.jl/stable/rankdeficiency/

dmbates commented 3 years ago

It might be a good idea to benchmark the singular value decomposition. As an iterative method it is definitely more expensive than direct methods like Cholesky and QR but some of the recent SVD methods are surprisingly fast. The point is that the condition number is a property of the singular values. @andreasnoack may be able to offer more informed commentary on the SVD methods that are available. I haven't really kept up.

pdeffebach commented 3 years ago

@andreasnoack I'm not equipped to do a full numerical instability analysis with R's algorithm, but here is a full MWE fo you.

R (from fixest) here

library(tidyverse)
library(fixest)
cpp_cholesky = utils::getFromNamespace("cpp_cholesky", "fixest"); 
df = read_csv("https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv")
m = feols(Price ~ Year + Mileage, df)
X = model.matrix(m)
XtX = t(X) %*% X 
cpp_cholesky(XtX, tol=80.0) # full rank
cpp_cholesky(XtX, tol=81.0) # not full rank

and Julia

using CSV, HTTP, GLM, LinearAlgebra

url = "https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv";
df = CSV.File(HTTP.get(url).body);
m = lm(@formula(Price ~ Year + Mileage), df);
X = m.mm.m;
XtX = Hermitian(X' * X);
cholesky(XtX, Val(true), tol = 1e-5, check = true) # full rank
cholesky(XtX, Val(true), tol = 1e-4, check = true) # not full rank
lapack_tol = N * eps(Float64) * maximum(diag(XtX))

I don't understand what fixest's tol is doing. It's odd that the default tolerance is 1e-10 but the invertability change happens at a value near 80.

dmbates commented 3 years ago

@pdeffebach It appears that your R code is not using the Cholesky factorization in R but a specific method, which from the name I expect is coded in C++, called cpp_cholesky in the fixest package.

pdeffebach commented 3 years ago

yeah, I know. It felt like a decent benchmark because fixest uses Cholesky by default, unlike base R.

dmbates commented 3 years ago

The code for that method starts around line 130 of https://github.com/lrberge/fixest/blob/master/src/lm_related.cpp

The linear algebra code we are using in Julia is LAPACK/OpenBLAS or MKL - code that has been developed by the foremost leaders in the field of numerical linear algebra and tested for over 40 years. I think when someone codes up their own C++ implementation it falls more on them to reconcile differences in results.

pdeffebach commented 3 years ago

True, but LAPACK is not designed specifically with the context of GLM in mind. Maybe their choices better reflect a trade-off between numeric stability and flexibility that is specific to linear regression.

nalimilan commented 3 years ago

I'm also a bit worried that we fail where R succeeds. As this report shows, people are going to blame us because of this. Do we have any idea whether this particular dataset is very special or whether we can expect frequent failures like this?

Maybe when dropcollinear=true fails we should automatically try with dropcollinear=false? Or by default try a non-pivoted decomposition first, and only used the pivoted one if the former fails? In this case this would succeed and everything would be right.

dmbates commented 3 years ago

I hadn't read the whole thread of this issue. I think fixest is a distraction and we should concentrate on the methods in base R if we are going to compare to the GLM results.

There is a difference between using Cholesky and QR to solve a least squares problem because the Cholesky involves creating X'X and QR works directly on X, as does SVD. The condition number of X'X is the square of the condition number of X so there is immediately a disadvantage in using Cholesky and it will affect the estimate of the rank. Having said that, I will point out that there is no reliable method of determining the rank of an arbitrary matrix numerically. In floating point arithmetic you simply can't do it because arbitrarily small perturbations can change the answer.

The best you can do is to come up with a scheme for handling numerical rank deficiencies and document it.

This is an ill-conditioned problem, which means that different systems will give different answers regarding the number of estimable coefficients. (Standard techniques such as mean centering the regressor variables will help,) The condition number of the X matrix is about 10^8 and the condition number of X'X is about 10^16 which means it is computationally singular.

julia> using CSV, DataFrames, LinearAlgebra, Statistics

julia> data = CSV.read("/home/bates/Downloads/Car-Training.csv", DataFrame);

julia> X = hcat(ones(nrow(data)), data.Year, data.Mileage);

julia> cond(X)
9.957869522988558e7

julia> cond(Symmetric(X'X))
9.915916546575758e15

By comparison, if you mean-center the Year and Mileage columns you have much more reasonable conditioning

Xcen = hcat(ones(nrow(data)), data.Year .- mean(data.Year), data.Mileage .- mean(data.Mileage))

julia> cond(Xcen)
21489.22367695712

julia> cond(Symmetric(Xcen'Xcen))
4.617867342370313e8

I'm also a bit worried that we fail where R succeeds. As this report shows, people are going to blame us because of this. Do we have any idea whether this particular dataset is very special or whether we can expect frequent failures like this?

It is unhelpful to use expressions like "succeed" or "fail" in these circumstances. This is an ill-conditioned problem and there is no "correct" answer.

dmbates commented 3 years ago

If it is felt that we should emphasize numerical accuracy over speed then I would be willing to work on a rewrite of the package to use a QR factorization by default. We have various methods in MixedModels to try to detect and handle rank deficiency but none of them are foolproof.

nalimilan commented 3 years ago

It is unhelpful to use expressions like "succeed" or "fail" in these circumstances. This is an ill-conditioned problem and there is no "correct" answer.

Right. Let me phrase it differently: people are less likely to complain if estimation returns a finite coefficient for each variable when two variables are near-collinear (however meaningless coefficients may be). They are used to checking for collinearity issues using various indicators. And more importantly they can drop one of the problematic columns and estimate another model. But if we silently drop a variable like we do here, we are more likely to annoy them, and the solution is less trivial than dropping one of the predictors. So I'd rather err on the side of assuming the matrix is full rank.

If it is felt that we should emphasize numerical accuracy over speed then I would be willing to work on a rewrite of the package to use a QR factorization by default. We have various methods in MixedModels to try to detect and handle rank deficiency but none of them are foolproof.

Favouring accuracy by default indeed sounds like a reasonable approach to me, and it doesn't prevent making it easy to use faster methods. Though I can't really comment on the details TBH.

pdeffebach commented 3 years ago

It's worth noting that lm, feols and Stata (I just checked) all give the same results for this particular problem. So if there is a numerical instability in the decomposition of X'X, it's not manifesting itself across 3 different methods, indicating we may be being overly defensive against giving the user wrong results.

This may be a naive suggestion, but is there a reason we can't calculate what LAPACK's precision would be, and then just divide that by 10?

lrberge commented 3 years ago

For the record, in fixest I wanted the following features: i) give the possibility of user interrupts (without having to quit the R session), ii) parallel processing, and iii) collinear variable removal. Since a low level function with these features couldn't be directly accessed in R I had to implement my own Cholesky.

That said, in case of an ill conditioned model, the variables removed do differ between feols and lm (although, in general, they tend to be similar).

Just my 2 cents: I stand with the general opinion that having the same variables removed to avoid collinearity is not possible across software. Different algorithms imply different solutions (QR vs Cholesky + different ways to pivot and tolerance levels), so the only way to have the same results would be to apply the same algorithms across software, which is not desirable (since the liberty of implementation would be lost). And anyway, it's always the user's responsibility to provide a proper model.

ericqu commented 3 years ago

In SAS, you get a warning when using this type of data with proc reg:

proc reg data=work.cars ;
  model Price = Year Mileage;
run;

WARNING: The range of variable Year is so small relative to its mean that there may be loss of accuracy in the computations. You may need to rescale the variable to have a larger value of RANGE/abs(MEAN), for example, by using PROC STANDARD M=0;

I like that it is only a warning. The suggestion is not too prescriptive as other options could be good depending on the problem.

with proc glm, you get a warning in the results.

proc glm data=work.cars;
    class Year;
    model Price = Year Mileage / solution;
run;

sas-glm

My view is we should transparently indicate that the problem is ill-conditioned when it is the case (like here) and provide some diagnostic if we have some.

xianwenchen commented 3 years ago

Thank you all for the discussion.

I have not been able to make time to look into the technical details.

I agree that offering diagnostics or hints well be very helpful, from the perspective of a user.

kescobo commented 2 years ago

I think I just got bit by this issue as well. Not sure if this adds to the discussion at all, it's a bit above my head, but my issue seems to happen with moderately large numbers (on the order of 1e7). If I just scale the larger column by 10^6, the issue goes away, which I wouldn't think has anything to do with collinearity (though dropcollinear=false also solves it). Here's a MWE:

using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall
using Distributions

df = DataFrame(y = rand(Normal(), 200), x = rand(Normal(), 200), c1 = rand(Normal(1e7, 1e6), 200))
df.c2 = df.c1 ./ 1e6 

modjl = DataFrame(coeftable(lm(@formula(y ~ x + c1), df)))
modjl2 = DataFrame(coeftable(lm(@formula(y ~ x + c1), df; dropcollinear=false)))
modjl3 = DataFrame(coeftable(lm(@formula(y ~ x + c2), df)))

@rput df
R"modR <- lm(y ~ x + c1, df)"
R"modR2 <- lm(y ~ x + c2, df)"

pretty_table(select(modjl, 1:5))
pretty_table(select(modjl2, 1:5))
pretty_table(select(modjl3, 1:5))
R"summary(modR)"
R"summary(modR2)"

modjl (the problem):

┌─────────────┬─────────────┬────────────┬────────────┬──────────┐
│        Name │       Coef. │ Std. Error │          t │ Pr(>|t|) │
│      String │     Float64 │    Float64 │    Float64 │  Float64 │
├─────────────┼─────────────┼────────────┼────────────┼──────────┤
│ (Intercept) │         0.0 │        NaN │        NaN │      NaN │
│           x │  -0.0242829 │  0.0739415 │  -0.328406 │ 0.742951 │
│          c1 │ 2.92387e-11 │ 7.65602e-9 │ 0.00381905 │ 0.996957 │
└─────────────┴─────────────┴────────────┴────────────┴──────────┘

modjl2 (with dropcollinear=false):

┌─────────────┬────────────┬────────────┬───────────┬──────────┐
│        Name │      Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │    Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │  -0.719064 │   0.740016 │ -0.971686 │ 0.332398 │
│           x │ -0.0309587 │  0.0742704 │ -0.416837 │ 0.677251 │
│          c1 │ 7.03214e-8 │ 7.27445e-8 │   0.96669 │ 0.334884 │
└─────────────┴────────────┴────────────┴───────────┴──────────┘

modjl3 scaled down 1/1e6, dropcollinear=true

┌─────────────┬────────────┬────────────┬───────────┬──────────┐
│        Name │      Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │    Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │  -0.719064 │   0.740016 │ -0.971686 │ 0.332398 │
│           x │ -0.0309587 │  0.0742704 │ -0.416837 │ 0.677251 │
│          c2 │  0.0703214 │  0.0727445 │   0.96669 │ 0.334884 │
└─────────────┴────────────┴────────────┴───────────┴──────────┘

The MWE has code to run in lme4 for completeness, but the results are identical to modjl2 and modjl3.

droodman commented 9 months ago

I'm having the same problem, in a regression with ~100 regressors ranging between 0 and 1, and ~9 million observations. That is, I have a linear regression that is running fine in R and Stata, but not lm(). I understand that the data matrix may be nearly degenerate. But the regression is also meaningful, what is called an event study for a difference-in-differences design. So for me this problem is a deterrent to using GLM at all. I wonder if Cholesky vs QR decomposition is the key difference in my case?

andreasnoack commented 9 months ago

Would you be able to share the data or some artificial data that causes the same issue? Otherwise, it's hard to debug. Also, could you elaborate on what "incorrect" and "running fine" mean here? If the matrix is nearly singular then it's hard to conclude anything just from the coefficients. Are the predicted values off?

droodman commented 9 months ago

I'm sorry to have been vague. I shouldn't post the data publicly, and it's a big file, but I could email it to someone. I am reasonably confident that the issue is the imprecision of the Cholesky decomposition--the design matrix is extremely ill-conditioned--and the case that this is an issue seems to have already been made convincingly above.

Moreover, after I posted this, I learned that a commit has been made to add a method=:qr option. But it is not yet in a release?

droodman commented 9 months ago

Results from the same regression with lm() and with regress in Stata. The key point is the discontinuities in the coefficients on the birth dummies with lm(), which are not realistic. Note that the dummy for 1961 is missing, but there there is data with birth year = 1961:

using CSV, DataFrames, GLM, Plots
df = CSV.read("c:/users/drood/Downloads/FEbug/FEbug.csv", DataFrame)
m = lm(@formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
     _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
     _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
     _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
     _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
     _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960 +_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
     _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
     _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
     _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
     _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000 )    , df)
plot(coef(m))

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                     Coef.   Std. Error       t  Pr(>|t|)    Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────────
_Ibirthyr_1907  0.0133333   0.0400579      0.33    0.7392  -0.0651786   0.0918453
_Ibirthyr_1908  0.0201342   0.020096       1.00    0.3164  -0.0192533   0.0595217
_Ibirthyr_1909  0.00325733  0.0197993      0.16    0.8693  -0.0355485   0.0420632
_Ibirthyr_1910  0.0125174   0.0129376      0.97    0.3333  -0.0128399   0.0378746
_Ibirthyr_1911  0.00793651  0.0138213      0.57    0.5658  -0.0191527   0.0350257
_Ibirthyr_1912  0.0101266   0.0123425      0.82    0.4120  -0.0140644   0.0343175
_Ibirthyr_1913  0.013834    0.0109051      1.27    0.2046  -0.00753954  0.0352075
_Ibirthyr_1914  0.0112      0.00981213     1.14    0.2537  -0.00803143  0.0304314
_Ibirthyr_1915  0.00783929  0.00767886     1.02    0.3073  -0.007211    0.0228896
_Ibirthyr_1916  0.013901    0.00723046     1.92    0.0545  -0.00027048  0.0280724
_Ibirthyr_1917  0.0166099   0.00715928     2.32    0.0203   0.00257795  0.0306418
_Ibirthyr_1918  0.013571    0.005659       2.40    0.0165   0.0024796   0.0246625
_Ibirthyr_1919  0.0164619   0.00543777     3.03    0.0025   0.00580407  0.0271198
_Ibirthyr_1920  0.0185091   0.00452062     4.09    <1e-04   0.00964884  0.0273693
_Ibirthyr_1921  0.019617    0.00432862     4.53    <1e-05   0.0111331   0.0281009
_Ibirthyr_1922  0.0222837   0.00442437     5.04    <1e-06   0.0136121   0.0309553
_Ibirthyr_1923  0.0212606   0.00387956     5.48    <1e-07   0.0136568   0.0288644
_Ibirthyr_1924  0.0251841   0.00378106     6.66    <1e-10   0.0177734   0.0325949
_Ibirthyr_1925  0.0228956   0.0032005      7.15    <1e-12   0.0166227   0.0291684
_Ibirthyr_1926  0.0256301   0.00319588     8.02    <1e-14   0.0193663   0.031894
_Ibirthyr_1927  0.0242324   0.00300946     8.05    <1e-15   0.018334    0.0301308
_Ibirthyr_1928  0.0266674   0.00263851    10.11    <1e-23   0.0214961   0.0318388
_Ibirthyr_1929  0.0271973   0.00257928    10.54    <1e-25   0.022142    0.0322527
_Ibirthyr_1930  0.028593    0.00218768    13.07    <1e-38   0.0243052   0.0328808
_Ibirthyr_1931  0.0292786   0.00219551    13.34    <1e-39   0.0249755   0.0335818
_Ibirthyr_1932  0.0289725   0.00212247    13.65    <1e-41   0.0248125   0.0331324
_Ibirthyr_1933  0.0296184   0.00192592    15.38    <1e-52   0.0258437   0.0333932
_Ibirthyr_1934  0.0313768   0.00188122    16.68    <1e-61   0.0276897   0.0350639
_Ibirthyr_1935  0.0326611   0.00169076    19.32    <1e-82   0.0293473   0.0359749
_Ibirthyr_1936  0.0362075   0.00167561    21.61    <1e-99   0.0329234   0.0394917
_Ibirthyr_1937  0.0356623   0.00166778    21.38    <1e-99   0.0323935   0.0389311
_Ibirthyr_1938  0.0379183   0.00154288    24.58    <1e-99   0.0348944   0.0409423
_Ibirthyr_1939  0.0400083   0.00150811    26.53    <1e-99   0.0370525   0.0429642
_Ibirthyr_1940  0.0397725   0.00134043    29.67    <1e-99   0.0371453   0.0423997
_Ibirthyr_1941  0.044141    0.001397      31.60    <1e-99   0.041403    0.0468791
_Ibirthyr_1942  0.0449556   0.00134247    33.49    <1e-99   0.0423244   0.0475868
_Ibirthyr_1943  0.0466286   0.00130049    35.85    <1e-99   0.0440797   0.0491775
_Ibirthyr_1944  0.0492167   0.0013432     36.64    <1e-99   0.0465841   0.0518493
_Ibirthyr_1945  0.0504391   0.00118552    42.55    <1e-99   0.0481155   0.0527627
_Ibirthyr_1946  0.124416    0.00121649   102.27    <1e-99   0.122032    0.1268
_Ibirthyr_1947  0.124994    0.00123887   100.89    <1e-99   0.122565    0.127422
_Ibirthyr_1948  0.126589    0.00116365   108.79    <1e-99   0.124309    0.12887
_Ibirthyr_1949  0.127993    0.00115564   110.76    <1e-99   0.125728    0.130258
_Ibirthyr_1950  0.132723    0.00103157   128.66    <1e-99   0.130701    0.134744
_Ibirthyr_1951  0.134958    0.00106753   126.42    <1e-99   0.132866    0.137051
_Ibirthyr_1952  0.134466    0.00105115   127.92    <1e-99   0.132405    0.136526
_Ibirthyr_1953  0.135625    0.000993632  136.49    <1e-99   0.133677    0.137572
_Ibirthyr_1954  0.137397    0.000973637  141.12    <1e-99   0.135489    0.139306
_Ibirthyr_1955  0.0866432   0.000915904   94.60    <1e-99   0.084848    0.0884383
_Ibirthyr_1956  0.0928981   0.000912533  101.80    <1e-99   0.0911095   0.0946866
_Ibirthyr_1957  0.0974438   0.000906342  107.51    <1e-99   0.0956674   0.0992202
_Ibirthyr_1958  0.102582    0.0008524    120.34    <1e-99   0.100911    0.104252
_Ibirthyr_1959  0.113558    0.000846193  134.20    <1e-99   0.111899    0.115216
_Ibirthyr_1960  0.115621    0.000770211  150.12    <1e-99   0.114111    0.11713
_Ibirthyr_1962  0.137555    0.00080227   171.46    <1e-99   0.135982    0.139127
_Ibirthyr_1963  0.141234    0.000766621  184.23    <1e-99   0.139731    0.142736
_Ibirthyr_1964  0.14889     0.000773525  192.48    <1e-99   0.147374    0.150406
_Ibirthyr_1965  0.149612    0.000727619  205.62    <1e-99   0.148186    0.151038
_Ibirthyr_1966  0.151283    0.000754572  200.49    <1e-99   0.149804    0.152762
_Ibirthyr_1967  0.150603    0.000776434  193.97    <1e-99   0.149081    0.152125
_Ibirthyr_1968  0.150456    0.000748029  201.14    <1e-99   0.14899     0.151922
_Ibirthyr_1969  0.151651    0.000731701  207.26    <1e-99   0.150217    0.153085
_Ibirthyr_1970  0.149829    0.0007037    212.92    <1e-99   0.148449    0.151208
_Ibirthyr_1971  0.150174    0.00073155   205.28    <1e-99   0.14874     0.151608
_Ibirthyr_1972  0.151469    0.000720704  210.17    <1e-99   0.150056    0.152881
_Ibirthyr_1973  0.148427    0.000722602  205.41    <1e-99   0.147011    0.149844
_Ibirthyr_1974  0.157152    0.000740608  212.19    <1e-99   0.1557      0.158603
_Ibirthyr_1975  0.15763     0.00072148   218.48    <1e-99   0.156216    0.159044
_Ibirthyr_1976  0.163673    0.000746944  219.12    <1e-99   0.162209    0.165137
_Ibirthyr_1977  0.169546    0.000769795  220.25    <1e-99   0.168038    0.171055
_Ibirthyr_1978  0.174651    0.000788054  221.62    <1e-99   0.173106    0.176195
_Ibirthyr_1979  0.178854    0.00080135   223.19    <1e-99   0.177283    0.180424
_Ibirthyr_1980  0.178202    0.000764457  233.11    <1e-99   0.176704    0.1797
_Ibirthyr_1981  0.190883    0.000807391  236.42    <1e-99   0.1893      0.192465
_Ibirthyr_1982  0.192958    0.000789277  244.47    <1e-99   0.191411    0.194505
_Ibirthyr_1983  0.153268    0.000808058  189.67    <1e-99   0.151684    0.154852
_Ibirthyr_1984  0.153286    0.000839944  182.50    <1e-99   0.15164     0.154932
_Ibirthyr_1985  0.157197    0.000868201  181.06    <1e-99   0.155495    0.158898
_Ibirthyr_1986  0.162169    0.000918852  176.49    <1e-99   0.160368    0.16397
_Ibirthyr_1987  0.166885    0.000982972  169.78    <1e-99   0.164958    0.168811
_Ibirthyr_1988  0.168767    0.00103866   162.49    <1e-99   0.166732    0.170803
_Ibirthyr_1989  0.169901    0.00112845   150.56    <1e-99   0.167689    0.172113
_Ibirthyr_1990  0.165285    0.00120854   136.76    <1e-99   0.162916    0.167654
_Ibirthyr_1991  0.167873    0.00136576   122.92    <1e-99   0.165196    0.17055
_Ibirthyr_1992  0.260221    0.0015163    171.62    <1e-99   0.257249    0.263193
_Ibirthyr_1993  0.263708    0.00162497   162.28    <1e-99   0.260523    0.266892
_Ibirthyr_1994  0.27971     0.00168289   166.21    <1e-99   0.276412    0.283008
_Ibirthyr_1995  0.287836    0.00175984   163.56    <1e-99   0.284386    0.291285
_Ibirthyr_1996  0.29295     0.00187822   155.97    <1e-99   0.289269    0.296632
_Ibirthyr_1997  0.290477    0.00212673   136.58    <1e-99   0.286308    0.294645
_Ibirthyr_1998  0.283585    0.00260688   108.78    <1e-99   0.278475    0.288694
_Ibirthyr_1999  0.271408    0.00355811    76.28    <1e-99   0.264434    0.278381
_Ibirthyr_2000  0.285458    0.00848653    33.64    <1e-99   0.268825    0.302091
─────────────────────────────────────────────────────────────────────────────────

. #delimit ;
delimiter now ;
. reg part _Ibirthyr_1907 _Ibirthyr_1908 _Ibirthyr_1909 _Ibirthyr_1910 _Ibirthyr_1911 _Ibirthyr_1912 _Ibirthyr_1913 _Ibirthyr_1914 
>      _Ibirthyr_1915 _Ibirthyr_1916 _Ibirthyr_1917 _Ibirthyr_1918 _Ibirthyr_1919 _Ibirthyr_1920 _Ibirthyr_1921 _Ibirthyr_1922 _Ibirthyr_1923 _Ibirthyr_1924 
>      _Ibirthyr_1925 _Ibirthyr_1926 _Ibirthyr_1927 _Ibirthyr_1928 _Ibirthyr_1929 _Ibirthyr_1930 _Ibirthyr_1931 _Ibirthyr_1932 _Ibirthyr_1933 _Ibirthyr_1934 
>      _Ibirthyr_1935 _Ibirthyr_1936 _Ibirthyr_1937 _Ibirthyr_1938 _Ibirthyr_1939 _Ibirthyr_1940 _Ibirthyr_1941 _Ibirthyr_1942 _Ibirthyr_1943 _Ibirthyr_1944 
>      _Ibirthyr_1945 _Ibirthyr_1946 _Ibirthyr_1947 _Ibirthyr_1948 _Ibirthyr_1949 _Ibirthyr_1950 _Ibirthyr_1951 _Ibirthyr_1952 _Ibirthyr_1953 _Ibirthyr_1954 
>      _Ibirthyr_1955 _Ibirthyr_1956 _Ibirthyr_1957 _Ibirthyr_1958 _Ibirthyr_1959 _Ibirthyr_1960  _Ibirthyr_1962 _Ibirthyr_1963 _Ibirthyr_1964 
>      _Ibirthyr_1965 _Ibirthyr_1966 _Ibirthyr_1967 _Ibirthyr_1968 _Ibirthyr_1969 _Ibirthyr_1970 _Ibirthyr_1971 _Ibirthyr_1972 _Ibirthyr_1973 _Ibirthyr_1974 
>      _Ibirthyr_1975 _Ibirthyr_1976 _Ibirthyr_1977 _Ibirthyr_1978 _Ibirthyr_1979 _Ibirthyr_1980 _Ibirthyr_1981 _Ibirthyr_1982 _Ibirthyr_1983 _Ibirthyr_1984 
>      _Ibirthyr_1985 _Ibirthyr_1986 _Ibirthyr_1987 _Ibirthyr_1988 _Ibirthyr_1989 _Ibirthyr_1990 _Ibirthyr_1991 _Ibirthyr_1992 _Ibirthyr_1993 _Ibirthyr_1994 
>      _Ibirthyr_1995 _Ibirthyr_1996 _Ibirthyr_1997 _Ibirthyr_1998 _Ibirthyr_1999 _Ibirthyr_2000;

coefplot, vertical;

      Source |       SS           df       MS      Number of obs   = 8,830,997
-------------+----------------------------------   F(93, 8830903)  =   2891.59
       Model |  32017.6199        93  344.275483   Prob > F        =    0.0000
    Residual |  1051414.41 8,830,903  .119060804   R-squared       =    0.0296
-------------+----------------------------------   Adj R-squared   =    0.0295
       Total |  1083432.03 8,830,996  .122685146   Root MSE        =    .34505

--------------------------------------------------------------------------------
          part | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
---------------+----------------------------------------------------------------
_Ibirthyr_1907 |  -.1142327   .0398512    -2.87   0.004    -.1923397   -.0361257
_Ibirthyr_1908 |  -.1074318   .0200044    -5.37   0.000    -.1466397   -.0682239
_Ibirthyr_1909 |  -.1243087   .0197095    -6.31   0.000    -.1629386   -.0856789
_Ibirthyr_1910 |  -.1150487   .0128932    -8.92   0.000    -.1403189   -.0897784
_Ibirthyr_1911 |  -.1196295   .0137706    -8.69   0.000    -.1466193   -.0926397
_Ibirthyr_1912 |  -.1174395   .0123026    -9.55   0.000     -.141552   -.0933269
_Ibirthyr_1913 |   -.113732   .0108762   -10.46   0.000     -.135049   -.0924151
_Ibirthyr_1914 |   -.116366   .0097924   -11.88   0.000    -.1355588   -.0971732
_Ibirthyr_1915 |  -.1197267   .0076797   -15.59   0.000    -.1347787   -.1046748
_Ibirthyr_1916 |  -.1136651   .0072363   -15.71   0.000    -.1278479   -.0994823
_Ibirthyr_1917 |  -.1109562   .0071659   -15.48   0.000    -.1250011   -.0969112
_Ibirthyr_1918 |   -.113995   .0056855   -20.05   0.000    -.1251384   -.1028516
_Ibirthyr_1919 |  -.1111041   .0054677   -20.32   0.000    -.1218207   -.1003875
_Ibirthyr_1920 |   -.109057   .0045673   -23.88   0.000    -.1180088   -.1001051
_Ibirthyr_1921 |   -.107949   .0043795   -24.65   0.000    -.1165326   -.0993655
_Ibirthyr_1922 |  -.1052824   .0044731   -23.54   0.000    -.1140495   -.0965152
_Ibirthyr_1923 |  -.1063054   .0039412   -26.97   0.000      -.11403   -.0985808
_Ibirthyr_1924 |  -.1023819   .0038453   -26.62   0.000    -.1099186   -.0948452
_Ibirthyr_1925 |  -.1046705   .0032828   -31.88   0.000    -.1111046   -.0982363
_Ibirthyr_1926 |  -.1019359   .0032783   -31.09   0.000    -.1083613   -.0955105
_Ibirthyr_1927 |  -.1033336   .0030989   -33.35   0.000    -.1094073     -.09726
_Ibirthyr_1928 |  -.1008986   .0027441   -36.77   0.000     -.106277   -.0955202
_Ibirthyr_1929 |  -.1003687   .0026879   -37.34   0.000    -.1056368   -.0951006
_Ibirthyr_1930 |   -.098973    .002319   -42.68   0.000    -.1035182   -.0944278
_Ibirthyr_1931 |  -.0982874   .0023263   -42.25   0.000    -.1028469   -.0937279
_Ibirthyr_1932 |  -.0985936   .0022583   -43.66   0.000    -.1030197   -.0941674
_Ibirthyr_1933 |  -.0979476   .0020767   -47.17   0.000    -.1020178   -.0938774
_Ibirthyr_1934 |  -.0961892   .0020357   -47.25   0.000    -.1001792   -.0921993
_Ibirthyr_1935 |  -.0949049   .0018631   -50.94   0.000    -.0985565   -.0912533
_Ibirthyr_1936 |  -.0913585   .0018495   -49.40   0.000    -.0949835   -.0877335
_Ibirthyr_1937 |  -.0919038   .0018425   -49.88   0.000     -.095515   -.0882925
_Ibirthyr_1938 |  -.0896477   .0017315   -51.77   0.000    -.0930413    -.086254
_Ibirthyr_1939 |  -.0875577   .0017009   -51.48   0.000    -.0908914    -.084224
_Ibirthyr_1940 |  -.0877936   .0015558   -56.43   0.000    -.0908429   -.0847442
_Ibirthyr_1941 |   -.083425   .0016043   -52.00   0.000    -.0865694   -.0802807
_Ibirthyr_1942 |  -.0826104   .0015576   -53.04   0.000    -.0856632   -.0795577
_Ibirthyr_1943 |  -.0809374   .0015219   -53.18   0.000    -.0839203   -.0779545
_Ibirthyr_1944 |  -.0783493   .0015582   -50.28   0.000    -.0814033   -.0752954
_Ibirthyr_1945 |  -.0771269    .001426   -54.09   0.000    -.0799218    -.074332
_Ibirthyr_1946 |  -.0727114   .0014516   -50.09   0.000    -.0755564   -.0698664
_Ibirthyr_1947 |  -.0709548   .0014702   -48.26   0.000    -.0738363   -.0680733
_Ibirthyr_1948 |   -.065841   .0014081   -46.76   0.000    -.0686007   -.0630812
_Ibirthyr_1949 |  -.0628813   .0014015   -44.87   0.000    -.0656282   -.0601344
_Ibirthyr_1950 |   -.063654   .0013022   -48.88   0.000    -.0662063   -.0611017
_Ibirthyr_1951 |   -.054803   .0013306   -41.19   0.000    -.0574109   -.0521951
_Ibirthyr_1952 |  -.0521713   .0013176   -39.60   0.000    -.0547537   -.0495888
_Ibirthyr_1953 |  -.0479237   .0012727   -37.66   0.000    -.0504182   -.0454293
_Ibirthyr_1954 |   -.042377   .0012573   -33.70   0.000    -.0448413   -.0399127
_Ibirthyr_1955 |  -.0409229   .0012136   -33.72   0.000    -.0433016   -.0385442
_Ibirthyr_1956 |   -.034668   .0012111   -28.62   0.000    -.0370417   -.0322942
_Ibirthyr_1957 |  -.0301223   .0012065   -24.97   0.000     -.032487   -.0277575
_Ibirthyr_1958 |  -.0249844    .001167   -21.41   0.000    -.0272717   -.0226972
_Ibirthyr_1959 |  -.0140082   .0011625   -12.05   0.000    -.0162867   -.0117298
_Ibirthyr_1960 |  -.0119452    .001109   -10.77   0.000    -.0141188   -.0097716
_Ibirthyr_1962 |   .0099888   .0011313     8.83   0.000     .0077715     .012206
_Ibirthyr_1963 |   .0136677   .0011065    12.35   0.000     .0114989    .0158365
_Ibirthyr_1964 |    .021324   .0011113    19.19   0.000     .0191459    .0235021
_Ibirthyr_1965 |   .0182448   .0010802    16.89   0.000     .0161277     .020362
_Ibirthyr_1966 |   .0262808   .0010983    23.93   0.000     .0241282    .0284335
_Ibirthyr_1967 |   .0273999   .0011133    24.61   0.000     .0252179    .0295819
_Ibirthyr_1968 |   .0292593   .0010939    26.75   0.000     .0271154    .0314033
_Ibirthyr_1969 |   .0313718   .0010829    28.97   0.000     .0292493    .0334943
_Ibirthyr_1970 |   .0274142   .0010644    25.76   0.000      .025328    .0295004
_Ibirthyr_1971 |   .0323644   .0010828    29.89   0.000     .0302422    .0344867
_Ibirthyr_1972 |   .0297249   .0010756    27.64   0.000     .0276168     .031833
_Ibirthyr_1973 |   .0254691   .0010768    23.65   0.000     .0233585    .0275797
_Ibirthyr_1974 |   .0295858   .0010889    27.17   0.000     .0274516      .03172
_Ibirthyr_1975 |   .0300637   .0010761    27.94   0.000     .0279546    .0321728
_Ibirthyr_1976 |   .0361066   .0010932    33.03   0.000      .033964    .0382491
_Ibirthyr_1977 |   .0419803   .0011087    37.86   0.000     .0398073    .0441534
_Ibirthyr_1978 |   .0470845   .0011213    41.99   0.000     .0448867    .0492823
_Ibirthyr_1979 |   .0512878   .0011306    45.36   0.000     .0490718    .0535038
_Ibirthyr_1980 |   .0506358   .0011051    45.82   0.000     .0484699    .0528017
_Ibirthyr_1981 |   .0633167   .0011349    55.79   0.000     .0610924     .065541
_Ibirthyr_1982 |   .0653921   .0011222    58.27   0.000     .0631926    .0675915
_Ibirthyr_1983 |   .0718306   .0011353    63.27   0.000     .0696054    .0740559
_Ibirthyr_1984 |    .078832    .001158    68.08   0.000     .0765624    .0811017
_Ibirthyr_1985 |    .086207   .0011784    73.15   0.000     .0838973    .0885167
_Ibirthyr_1986 |   .0991375   .0012158    81.54   0.000     .0967545    .1015205
_Ibirthyr_1987 |   .1079642   .0012645    85.38   0.000     .1054859    .1104426
_Ibirthyr_1988 |   .1152806   .0013078    88.15   0.000     .1127174    .1178438
_Ibirthyr_1989 |    .116699   .0013794    84.60   0.000     .1139954    .1194026
_Ibirthyr_1990 |   .1190085    .001445    82.36   0.000     .1161764    .1218406
_Ibirthyr_1991 |   .1272271   .0015775    80.65   0.000     .1241354    .1303189
_Ibirthyr_1992 |   .1326548   .0017081    77.66   0.000      .129307    .1360026
_Ibirthyr_1993 |   .1361415   .0018043    75.46   0.000     .1326053    .1396778
_Ibirthyr_1994 |    .152144    .001856    81.97   0.000     .1485063    .1557818
_Ibirthyr_1995 |   .1602695   .0019253    83.24   0.000     .1564959    .1640431
_Ibirthyr_1996 |   .1653843    .002033    81.35   0.000     .1613997    .1693688
_Ibirthyr_1997 |   .1629105   .0022622    72.01   0.000     .1584766    .1673444
_Ibirthyr_1998 |   .1560186   .0027141    57.49   0.000     .1506991    .1613381
_Ibirthyr_1999 |   .1438415   .0036287    39.64   0.000     .1367293    .1509537
_Ibirthyr_2000 |   .1578918    .008479    18.62   0.000     .1412731    .1745104
         _cons |    .127566   .0008019   159.08   0.000     .1259944    .1291377
--------------------------------------------------------------------------------

JamesMCannon commented 1 month ago

Failed_Fit.csv

I have now also encountered this issue and reproduced a minimum working example as it seems the conversation has died down without the underlying issue being fixed. I've attached a csv with data; in my case, I'm working with chunks of 300 data points at a time.

    using GLM
    using StatsModels
    using DataFrames
    using Plots

    data = CSV.read("Failed_Fit.csv",DataFrame)
    ols_fail = lm(@formula(Y ~ X), data)

    println(ols_fail)
    coefs_fail = coef(ols_fail)

    failed_fit=[]

    for t in data.X 
        sample = coefs_fail[1] + t*coefs_fail[2]
        append!(failed_fit,sample)
    end

    ols_success = lm(@formula(Y ~ X), data,dropcollinear=false)

    println(ols_success)
    coefs_success = coef(ols_success)

    successful_fit=[]

    for t in data.X 
        sample = coefs_success[1] + t*coefs_success[2]
        append!(successful_fit,sample)
    end

    Plots.plot(data.X,data.Y,seriestype=:scatter,markerstrokewidth=0,ms=5,label="Data")
    Plots.plot!(data.X,successful_fit,linewidth=2,label="Successful Fit")
    Plots.plot!(data.X,failed_fit,linewidth=2,label="Failed Fit")

    Plots.savefig("Fixed_vs_failed.png")

This produced the following graph.

Fixed_vs_failed

While the work-around of setting dropcollinear=false works, I'm a little frustrated to find that my issue is something that has been known about for over three years. Why hasn't this been resolved yet?

dmbates commented 1 month ago

The model matrix is poorly conditioned, as your X values are all between 76383 and 76682. You are trying to estimate an intercept that is very far away from the observed range of the X values, which leads to collinearity - just as it did three years ago.

julia> df = CSV.read("/Users/dmbates/Downloads/Failed_Fit.csv", DataFrame);

julia> X = hcat(ones(nrow(df)), df.X);

julia> cond(qr(X).R)
6.763385429589838e7

What do you want to do with the model that you would fit and can it be accomplished by centering the X values?

JamesMCannon commented 1 month ago

I primarily want to extract the slope from the fit as the rate of change is the relevant parameter to other analysis I'm doing (I'm being a little bit intentionally vague here) so I'd expect centering the X values would work actually. Let me give it a shot and I'll report back.

JamesMCannon commented 1 month ago

Indeed, centering the data also provides the proper slope coefficient. With two possible workarounds (dropcollinear=false or centering the data at/near 0), is one approach better than the other?

palday commented 1 month ago

I would tend to prefer centering as long as it still gives you coefficients with the desired interpretation.

We've been a bit slow wrapping up a few loose ends, but I'll try to allocate some time in the next few weeks to get a release with the new QR-based solver out -- it's a bit slower but more numerically stable.

JuliaStats / GLM.jl

Incorrect linear regression results #426