Open xianwenchen opened 3 years ago
I forgot to say that I am using Julia 1.5.3 on Linux and 1.5.4 on Windows. Both produced same results.
@nalimilan, could you have a quick look at the issue and confirm or reject that it is a problem of GLM?
I also started a discussion on Discourse: https://discourse.julialang.org/t/can-someone-replicate-this-glm-problem-with-linear-regression-on-your-computer/60098
rafael.guerra found out that by subtracting 2000 from Year, the regression was estimated correctly. He wonders if the problem was due to a normalization error.
I think now it becomes more likely that the problem was within GLM.jl.
I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.
And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.
I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.
As mentioned on discourse, this stems from dropcollinear=true
in lm
. For some reason, it's thinking that the intercept and other variables are collinear, so it doesn't estimate the intercept term.
Oddly, it doesn't fail with dropcollinear=false
, but rather gives the same (correct) results as in R.
julia> data = CSV.File("Car-Training.csv");
julia> model = @formula( Price ~ Year + Mileage);
julia> lm(model, data, dropcollinear=true)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Price ~ 1 + Year + Mileage
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0 NaN NaN NaN NaN NaN
Year 8.17971 0.167978 48.70 <1e-73 7.84664 8.51278
Mileage -0.0580528 0.00949846 -6.11 <1e-07 -0.0768865 -0.0392191
─────────────────────────────────────────────────────────────────────────────────
julia> lm(model, data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Price ~ 1 + Year + Mileage
Coefficients:
────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept) -2.18759e6 3.52442e5 -6.21 <1e-07 -2.8865e6 -1.48868e6
Year 1096.15 175.282 6.25 <1e-08 748.558 1443.74
Mileage -0.0238368 0.00984145 -2.42 0.0172 -0.0433527 -0.00432081
────────────────────────────────────────────────────────────────────────────────────
Thank you. This indeed solved the issue. I will prepare a note for students so that they are aware.
As I have commented on Discourse, I’m not sure if this is a sane option to set dropcollinear=true
as the default. I have used a number of statistical software. This is the first time such an option is set as the default.
As you have mentioned on Discourse, Stata implements dropcollinear=true
. This means that my previous comment was invalid.
The cause is this line: https://github.com/JuliaStats/GLM.jl/blob/16268be9eee54769b72ac36d17e5fd38337c0aae/src/linpred.jl#L107
I don't understand cholesky factorization well enough. It's odd that cholesky!
with Val(true)
doesn't return roughly the same output as cholesky!
when the input matrix is of full rank. Should we check the rank ourself choose cholesky!(F)
if collinear is true
?
cc @andreasnoack
Just curious. How do you debug in Julia? Is there a track function? Or do you just print a lot of intermediate results to see where it went wrong?
In this instance I had a feeling and knew the library well enough to know what functions were being called.
But for debugging I generally use Exfiltrator.jl
and then lots of printing and "break points" i.e. asdf
in the code somewhere. I'm not a very sophisticated debugger.
Thank you!
I'd like to try to help. I don't fully understand the code. Could you rewrite F = pivot ? cholesky!(F, Val(true), tol = -one(T), check = false) : cholesky!(F)
into several lines here?
cholesky!(F, Val(true), tol = -one(T), check = false)
seems to be a function called cholesky()
that takes four parameters.
A ! is added after cholesky
. I don't remember what !
does in Julia. However, I think I can find an answer myself to this one.
What does the :cholesky!(F)
do? Especially, what does :
mean here?
And what does the pivot ?
do, right after the =
sign?
There are two distinct misunderstandings here.
The first is Julia syntax. The syntax
x = a ? b : c
is the same as
x = begin
if a == true
b
else
c
end
end
so in this case pivot
is a boolean.
The !
in cholesky!
is not syntax, but rather a naming convention. We name functions to end in !
if it modifies the input. cholesky!(F, ...)
means the function overwrites F
. \
The next misunderstanding is what's going on mathematically. To better understand why we call cholesky
, check out this short explainer.
It seems that the default tolerance (tol=-1
) is too large so the decomposition fails:
julia> X = results.mm.m;
julia> F = Hermitian(float(X'X));
julia> cholesky(F, Val(true), tol=1e-5, check=true)
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 3:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
3.6764e5 18721.7 9.31674
⋅ 9036.63 4.49425
⋅ ⋅ 0.00369674
permutation:
3-element Vector{Int64}:
3
2
1
julia> cholesky(F, Val(true), tol=1e-4, check=true)
ERROR: RankDeficientException(1)
[...]
julia> cholesky(F, Val(true), tol=-1, check=true)
ERROR: RankDeficientException(1)
[...]
The default tolerance is defined like this:
TOL is DOUBLE PRECISION
User defined tolerance. If TOL < 0, then N*U*MAX( A(K,K) )
will be used. The algorithm terminates at the (K-1)st step
if the pivot <= TOL.
That code seems to use the QR decomposition, so do you think we can compare the tolerance?
Tha's a good point. Let's try fixest
whic uses Cholesky.
They define their own cholesky factorization here.
But their tol
keyword argument works differently than Julia's.
EDIT: I did this wrong.
There's some sort of scaling that goes on such that the keyword argument passed to feols
isn't the same as the one passed to cpp_cholesky
.
I wonder if it's worth pinging Laurent on this. He's been saying he's doing more things in Julia.
Thank you.
While I am not able to fully follow the technical discussion, I am grateful to see the progress.
I think part of the issue is from the fact that the Year variable in my exam dataset is quite invariant. The values are very close to the mean and does not change more than around 1% of the mean.
This triggers the question of how much tol
should be.
Below I create a new variable that follows a normal distribution, which has a larger mean than Year and smaller deviation as compared to the mean.
using GLM
using DataFrames
using Distributions
using CSV
data = CSV.read( "Car-Training.csv", DataFrame )
sample_size = length( data.Year )
data.NewVar = rand( Normal( 100000, 1 ), sample_size )
model = @formula( Price ~ NewVar + Mileage )
results = lm( model, data )
results = lm( model, data, dropcollinear = false )
Could you test whether an even smaller tol
is required to correctly estimate the model?
Is tol
a parameter to the lm()
function? If it seems that specifying tol
is required to make the algorithm work at special cases, it may be a good idea to let users specify tol
in the lm()
function.
What is the practical implication of changing tol
value from 1
to 1e-5
or 1e-7
temporarily?
I am wondering if it is possible to release a new version of GLM.jl, with tol
set to for example 1e-7
? If that's possible, I can ask students to simply update their GLM.jl, for the take-home exam that they are working on, which uses the dataset that I posted here. The benefits will be that the disturbance to the exam will be small, and that a more proper value of tol
or a more proper way of handling this issue could be addressed in a future release of GLM.jl.
While changing the tol
option to a smaller default or allowing it as a keyword argument are likely solutions to this problem, it's not going to happen in a release in time for your student's exam.
I would suggest creating a variable of years since 2008, i.e. df.Year = df.Year .- minimum(df.Year)
.
This is a bummer of a bug, though. It's unfortunate that we are uncovering this behavior, but thank you for bringing it to our attention.
Digging into the git blame
on this, it looks like @dmbates originally decided the tolerance for cholesky!
and set it equal to -1
. So he is also a good person to ask about that number in particular.
@pdeffebach That threshold stems from the days when that was the threshold in LinearAlgebra
for the QR and Cholesky decompositions (cf this discussion).
Changing the tolerance won't always fix the problem here though because if you're that ill conditioned, then you may or may not hit the threshold on different architectures. (Floating point math isn't associative so e.g. using AVX and other SIMD instructions can change the result. This led to intermittent test failures in MixedModels depending on whether or not the test was running on a machine with AVX512 support, which is what motivated the discussion that I linked.) As mentioned in #375, the extra precision from the QR decomposition can sometimes help, albeit at a slight performance penalty.
But the best choice here is an affine transformation of Year
so that all the variables are closer in scale and the coefficients are more interpretable (e.g. the intercept is not a measurement of "year 0", which is well before there were cars). Understanding that aspect -- interpretability of model coefficients as well as the existence of numeric issues in practice -- is something that I think is also very important for an applied statistics course.
All that said, I think we should probably
I agree with @palday here. In particular, I believe that age is a more meaningful variable here. It doesn't change the general issue though. Numerical rank determination is not clear-cut.
I just want to mention that the QR wouldn't make a difference for this particular problem
julia> cond(cholesky(Symmetric(X'X), Val(true)).U)
9.957869521461289e7
julia> cond(qr(X, Val(true)).R)
9.957869522989982e7
Notice that both correspond to reduced rank with respect to the R tolerance 1e-7
. The main difference is the pivoting strategy (and rank determination) used by the custom pivoted QR algorithm in R. I'm not aware of a numerical analysis of the algorithm used by R. It would be interesting to see such an analysis.
The numerical concerns are important, but I'm not super happy with a situation where lm
fails in Julia when it doesn't fail in R. I'm sure there are edge cases where it fails in R as well, and we should aim for our failures to be a subset of their failures.
Could we change the tolerance to 1e-5
? There is a instability-robustness tradeoff, and i'm not sure at exactly which parameter value that really kicks in.
This problem was caused by me, making dropcollinear = true
the default in #386. I think it's important that this problem does not occur when dropcollinear=false
. This happens because cholesky
(the method called when dropcollinear=false
) calls cholesky!(H, val(true), tol = 0.0, check=false)
. We were already enforcing a very strict tolerance before that change and things were working okay.
Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?
Of course, we can always revert dropcollinear=true
. The goal of that PR was to make GLM
easier for newcomers, but explaining rank deficiency is easier than explaining numerical errors in matrix decompositions.
@pdeffebach The numerical issues won't go away. Poorly conditioned models will often have misleading standard errors and other pathologies. I'm pretty strongly opposed to making things "too easy" because there will always be cases where these things come up the surface. Model diagnostics (does my model seem to actually fit my data? did something go wrong in fitting the model?) and centering and scaling covariates should be applied stats 101. Users don't have to know how our internal numerics work, but they should understand that numerical issues can and do arise in real-world statistical problems and that there are ways to check for false convergence (diagnostics) and ways to help the computer (centering and scaling).
Note in your historical counterexample, there's a problem: check=false
. That check disables the error checking and just assumes that the decomposition worked at a certain tolerance.
Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?
QR with Pivot, already discussed above.
Singular Value Decomposition, but this is expensive.
We could look at the condition number to guess whether or not a given calculation is likely to be ill-conditioned, but that's also an extra non trivial operation and doesn't tell what to do, just that whatever we do, there might be problems.
We have a whole page explaining how numerical rank deficiency crops up in mixed models and how rank deficiency is well defined in theory but really, really challenging in practice: https://juliastats.org/MixedModels.jl/stable/rankdeficiency/
It might be a good idea to benchmark the singular value decomposition. As an iterative method it is definitely more expensive than direct methods like Cholesky and QR but some of the recent SVD methods are surprisingly fast. The point is that the condition number is a property of the singular values. @andreasnoack may be able to offer more informed commentary on the SVD methods that are available. I haven't really kept up.
@andreasnoack I'm not equipped to do a full numerical instability analysis with R's algorithm, but here is a full MWE fo you.
R (from fixest
) here
library(tidyverse)
library(fixest)
cpp_cholesky = utils::getFromNamespace("cpp_cholesky", "fixest");
df = read_csv("https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv")
m = feols(Price ~ Year + Mileage, df)
X = model.matrix(m)
XtX = t(X) %*% X
cpp_cholesky(XtX, tol=80.0) # full rank
cpp_cholesky(XtX, tol=81.0) # not full rank
and Julia
using CSV, HTTP, GLM, LinearAlgebra
url = "https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv";
df = CSV.File(HTTP.get(url).body);
m = lm(@formula(Price ~ Year + Mileage), df);
X = m.mm.m;
XtX = Hermitian(X' * X);
cholesky(XtX, Val(true), tol = 1e-5, check = true) # full rank
cholesky(XtX, Val(true), tol = 1e-4, check = true) # not full rank
lapack_tol = N * eps(Float64) * maximum(diag(XtX))
I don't understand what fixest
's tol
is doing. It's odd that the default tolerance is 1e-10
but the invertability change happens at a value near 80
.
@pdeffebach It appears that your R code is not using the Cholesky factorization in R but a specific method, which from the name I expect is coded in C++, called cpp_cholesky
in the fixest
package.
yeah, I know. It felt like a decent benchmark because fixest
uses Cholesky by default, unlike base R.
The code for that method starts around line 130 of https://github.com/lrberge/fixest/blob/master/src/lm_related.cpp
The linear algebra code we are using in Julia is LAPACK/OpenBLAS or MKL - code that has been developed by the foremost leaders in the field of numerical linear algebra and tested for over 40 years. I think when someone codes up their own C++ implementation it falls more on them to reconcile differences in results.
True, but LAPACK is not designed specifically with the context of GLM in mind. Maybe their choices better reflect a trade-off between numeric stability and flexibility that is specific to linear regression.
I'm also a bit worried that we fail where R succeeds. As this report shows, people are going to blame us because of this. Do we have any idea whether this particular dataset is very special or whether we can expect frequent failures like this?
Maybe when dropcollinear=true
fails we should automatically try with dropcollinear=false
? Or by default try a non-pivoted decomposition first, and only used the pivoted one if the former fails? In this case this would succeed and everything would be right.
I hadn't read the whole thread of this issue. I think fixest is a distraction and we should concentrate on the methods in base R if we are going to compare to the GLM results.
There is a difference between using Cholesky and QR to solve a least squares problem because the Cholesky involves creating X'X and QR works directly on X, as does SVD. The condition number of X'X is the square of the condition number of X so there is immediately a disadvantage in using Cholesky and it will affect the estimate of the rank. Having said that, I will point out that there is no reliable method of determining the rank of an arbitrary matrix numerically. In floating point arithmetic you simply can't do it because arbitrarily small perturbations can change the answer.
The best you can do is to come up with a scheme for handling numerical rank deficiencies and document it.
This is an ill-conditioned problem, which means that different systems will give different answers regarding the number of estimable coefficients. (Standard techniques such as mean centering the regressor variables will help,) The condition number of the X matrix is about 10^8 and the condition number of X'X is about 10^16 which means it is computationally singular.
julia> using CSV, DataFrames, LinearAlgebra, Statistics
julia> data = CSV.read("/home/bates/Downloads/Car-Training.csv", DataFrame);
julia> X = hcat(ones(nrow(data)), data.Year, data.Mileage);
julia> cond(X)
9.957869522988558e7
julia> cond(Symmetric(X'X))
9.915916546575758e15
By comparison, if you mean-center the Year
and Mileage
columns you have much more reasonable conditioning
Xcen = hcat(ones(nrow(data)), data.Year .- mean(data.Year), data.Mileage .- mean(data.Mileage))
julia> cond(Xcen)
21489.22367695712
julia> cond(Symmetric(Xcen'Xcen))
4.617867342370313e8
I'm also a bit worried that we fail where R succeeds. As this report shows, people are going to blame us because of this. Do we have any idea whether this particular dataset is very special or whether we can expect frequent failures like this?
It is unhelpful to use expressions like "succeed" or "fail" in these circumstances. This is an ill-conditioned problem and there is no "correct" answer.
If it is felt that we should emphasize numerical accuracy over speed then I would be willing to work on a rewrite of the package to use a QR factorization by default. We have various methods in MixedModels to try to detect and handle rank deficiency but none of them are foolproof.
It is unhelpful to use expressions like "succeed" or "fail" in these circumstances. This is an ill-conditioned problem and there is no "correct" answer.
Right. Let me phrase it differently: people are less likely to complain if estimation returns a finite coefficient for each variable when two variables are near-collinear (however meaningless coefficients may be). They are used to checking for collinearity issues using various indicators. And more importantly they can drop one of the problematic columns and estimate another model. But if we silently drop a variable like we do here, we are more likely to annoy them, and the solution is less trivial than dropping one of the predictors. So I'd rather err on the side of assuming the matrix is full rank.
If it is felt that we should emphasize numerical accuracy over speed then I would be willing to work on a rewrite of the package to use a QR factorization by default. We have various methods in MixedModels to try to detect and handle rank deficiency but none of them are foolproof.
Favouring accuracy by default indeed sounds like a reasonable approach to me, and it doesn't prevent making it easy to use faster methods. Though I can't really comment on the details TBH.
It's worth noting that lm
, feols
and Stata (I just checked) all give the same results for this particular problem. So if there is a numerical instability in the decomposition of X'X
, it's not manifesting itself across 3 different methods, indicating we may be being overly defensive against giving the user wrong results.
This may be a naive suggestion, but is there a reason we can't calculate what LAPACK's precision would be, and then just divide that by 10?
For the record, in fixest
I wanted the following features: i) give the possibility of user interrupts (without having to quit the R session), ii) parallel processing, and iii) collinear variable removal. Since a low level function with these features couldn't be directly accessed in R I had to implement my own Cholesky.
That said, in case of an ill conditioned model, the variables removed do differ between feols
and lm
(although, in general, they tend to be similar).
Just my 2 cents: I stand with the general opinion that having the same variables removed to avoid collinearity is not possible across software. Different algorithms imply different solutions (QR vs Cholesky + different ways to pivot and tolerance levels), so the only way to have the same results would be to apply the same algorithms across software, which is not desirable (since the liberty of implementation would be lost). And anyway, it's always the user's responsibility to provide a proper model.
In SAS, you get a warning when using this type of data with proc reg
:
proc reg data=work.cars ;
model Price = Year Mileage;
run;
WARNING: The range of variable Year is so small relative to its mean that there may be loss of accuracy in the computations. You may need to rescale the variable to have a larger value of RANGE/abs(MEAN), for example, by using PROC STANDARD M=0;
I like that it is only a warning. The suggestion is not too prescriptive as other options could be good depending on the problem.
with proc glm
, you get a warning in the results.
proc glm data=work.cars;
class Year;
model Price = Year Mileage / solution;
run;
My view is we should transparently indicate that the problem is ill-conditioned when it is the case (like here) and provide some diagnostic if we have some.
Thank you all for the discussion.
I have not been able to make time to look into the technical details.
I agree that offering diagnostics or hints well be very helpful, from the perspective of a user.
I think I just got bit by this issue as well. Not sure if this adds to the discussion at all, it's a bit above my head, but my issue seems to happen with moderately large numbers (on the order of 1e7
). If I just scale the larger column by 10^6, the issue goes away, which I wouldn't think has anything to do with collinearity (though dropcollinear=false
also solves it). Here's a MWE:
using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall
using Distributions
df = DataFrame(y = rand(Normal(), 200), x = rand(Normal(), 200), c1 = rand(Normal(1e7, 1e6), 200))
df.c2 = df.c1 ./ 1e6
modjl = DataFrame(coeftable(lm(@formula(y ~ x + c1), df)))
modjl2 = DataFrame(coeftable(lm(@formula(y ~ x + c1), df; dropcollinear=false)))
modjl3 = DataFrame(coeftable(lm(@formula(y ~ x + c2), df)))
@rput df
R"modR <- lm(y ~ x + c1, df)"
R"modR2 <- lm(y ~ x + c2, df)"
pretty_table(select(modjl, 1:5))
pretty_table(select(modjl2, 1:5))
pretty_table(select(modjl3, 1:5))
R"summary(modR)"
R"summary(modR2)"
modjl
(the problem):
┌─────────────┬─────────────┬────────────┬────────────┬──────────┐
│ Name │ Coef. │ Std. Error │ t │ Pr(>|t|) │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────────────┼─────────────┼────────────┼────────────┼──────────┤
│ (Intercept) │ 0.0 │ NaN │ NaN │ NaN │
│ x │ -0.0242829 │ 0.0739415 │ -0.328406 │ 0.742951 │
│ c1 │ 2.92387e-11 │ 7.65602e-9 │ 0.00381905 │ 0.996957 │
└─────────────┴─────────────┴────────────┴────────────┴──────────┘
modjl2
(with dropcollinear=false
):
┌─────────────┬────────────┬────────────┬───────────┬──────────┐
│ Name │ Coef. │ Std. Error │ t │ Pr(>|t|) │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────────────┼────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │ -0.719064 │ 0.740016 │ -0.971686 │ 0.332398 │
│ x │ -0.0309587 │ 0.0742704 │ -0.416837 │ 0.677251 │
│ c1 │ 7.03214e-8 │ 7.27445e-8 │ 0.96669 │ 0.334884 │
└─────────────┴────────────┴────────────┴───────────┴──────────┘
modjl3
scaled down 1/1e6, dropcollinear=true
┌─────────────┬────────────┬────────────┬───────────┬──────────┐
│ Name │ Coef. │ Std. Error │ t │ Pr(>|t|) │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────────────┼────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │ -0.719064 │ 0.740016 │ -0.971686 │ 0.332398 │
│ x │ -0.0309587 │ 0.0742704 │ -0.416837 │ 0.677251 │
│ c2 │ 0.0703214 │ 0.0727445 │ 0.96669 │ 0.334884 │
└─────────────┴────────────┴────────────┴───────────┴──────────┘
The MWE has code to run in lme4
for completeness, but the results are identical to modjl2
and modjl3
.
I'm having the same problem, in a regression with ~100 regressors ranging between 0 and 1, and ~9 million observations. That is, I have a linear regression that is running fine in R and Stata, but not lm(). I understand that the data matrix may be nearly degenerate. But the regression is also meaningful, what is called an event study for a difference-in-differences design. So for me this problem is a deterrent to using GLM at all. I wonder if Cholesky vs QR decomposition is the key difference in my case?
Would you be able to share the data or some artificial data that causes the same issue? Otherwise, it's hard to debug. Also, could you elaborate on what "incorrect" and "running fine" mean here? If the matrix is nearly singular then it's hard to conclude anything just from the coefficients. Are the predicted values off?
I'm sorry to have been vague. I shouldn't post the data publicly, and it's a big file, but I could email it to someone. I am reasonably confident that the issue is the imprecision of the Cholesky decomposition--the design matrix is extremely ill-conditioned--and the case that this is an issue seems to have already been made convincingly above.
Moreover, after I posted this, I learned that a commit has been made to add a method=:qr
option. But it is not yet in a release?
Results from the same regression with lm() and with regress
in Stata. The key point is the discontinuities in the coefficients on the birth dummies with lm()
, which are not realistic. Note that the dummy for 1961 is missing, but there there is data with birth year = 1961:
using CSV, DataFrames, GLM, Plots
df = CSV.read("c:/users/drood/Downloads/FEbug/FEbug.csv", DataFrame)
m = lm(@formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
_Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
_Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
_Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
_Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
_Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960 +_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
_Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
_Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
_Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
_Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000 ) , df)
plot(coef(m))
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
_Ibirthyr_1907 0.0133333 0.0400579 0.33 0.7392 -0.0651786 0.0918453
_Ibirthyr_1908 0.0201342 0.020096 1.00 0.3164 -0.0192533 0.0595217
_Ibirthyr_1909 0.00325733 0.0197993 0.16 0.8693 -0.0355485 0.0420632
_Ibirthyr_1910 0.0125174 0.0129376 0.97 0.3333 -0.0128399 0.0378746
_Ibirthyr_1911 0.00793651 0.0138213 0.57 0.5658 -0.0191527 0.0350257
_Ibirthyr_1912 0.0101266 0.0123425 0.82 0.4120 -0.0140644 0.0343175
_Ibirthyr_1913 0.013834 0.0109051 1.27 0.2046 -0.00753954 0.0352075
_Ibirthyr_1914 0.0112 0.00981213 1.14 0.2537 -0.00803143 0.0304314
_Ibirthyr_1915 0.00783929 0.00767886 1.02 0.3073 -0.007211 0.0228896
_Ibirthyr_1916 0.013901 0.00723046 1.92 0.0545 -0.00027048 0.0280724
_Ibirthyr_1917 0.0166099 0.00715928 2.32 0.0203 0.00257795 0.0306418
_Ibirthyr_1918 0.013571 0.005659 2.40 0.0165 0.0024796 0.0246625
_Ibirthyr_1919 0.0164619 0.00543777 3.03 0.0025 0.00580407 0.0271198
_Ibirthyr_1920 0.0185091 0.00452062 4.09 <1e-04 0.00964884 0.0273693
_Ibirthyr_1921 0.019617 0.00432862 4.53 <1e-05 0.0111331 0.0281009
_Ibirthyr_1922 0.0222837 0.00442437 5.04 <1e-06 0.0136121 0.0309553
_Ibirthyr_1923 0.0212606 0.00387956 5.48 <1e-07 0.0136568 0.0288644
_Ibirthyr_1924 0.0251841 0.00378106 6.66 <1e-10 0.0177734 0.0325949
_Ibirthyr_1925 0.0228956 0.0032005 7.15 <1e-12 0.0166227 0.0291684
_Ibirthyr_1926 0.0256301 0.00319588 8.02 <1e-14 0.0193663 0.031894
_Ibirthyr_1927 0.0242324 0.00300946 8.05 <1e-15 0.018334 0.0301308
_Ibirthyr_1928 0.0266674 0.00263851 10.11 <1e-23 0.0214961 0.0318388
_Ibirthyr_1929 0.0271973 0.00257928 10.54 <1e-25 0.022142 0.0322527
_Ibirthyr_1930 0.028593 0.00218768 13.07 <1e-38 0.0243052 0.0328808
_Ibirthyr_1931 0.0292786 0.00219551 13.34 <1e-39 0.0249755 0.0335818
_Ibirthyr_1932 0.0289725 0.00212247 13.65 <1e-41 0.0248125 0.0331324
_Ibirthyr_1933 0.0296184 0.00192592 15.38 <1e-52 0.0258437 0.0333932
_Ibirthyr_1934 0.0313768 0.00188122 16.68 <1e-61 0.0276897 0.0350639
_Ibirthyr_1935 0.0326611 0.00169076 19.32 <1e-82 0.0293473 0.0359749
_Ibirthyr_1936 0.0362075 0.00167561 21.61 <1e-99 0.0329234 0.0394917
_Ibirthyr_1937 0.0356623 0.00166778 21.38 <1e-99 0.0323935 0.0389311
_Ibirthyr_1938 0.0379183 0.00154288 24.58 <1e-99 0.0348944 0.0409423
_Ibirthyr_1939 0.0400083 0.00150811 26.53 <1e-99 0.0370525 0.0429642
_Ibirthyr_1940 0.0397725 0.00134043 29.67 <1e-99 0.0371453 0.0423997
_Ibirthyr_1941 0.044141 0.001397 31.60 <1e-99 0.041403 0.0468791
_Ibirthyr_1942 0.0449556 0.00134247 33.49 <1e-99 0.0423244 0.0475868
_Ibirthyr_1943 0.0466286 0.00130049 35.85 <1e-99 0.0440797 0.0491775
_Ibirthyr_1944 0.0492167 0.0013432 36.64 <1e-99 0.0465841 0.0518493
_Ibirthyr_1945 0.0504391 0.00118552 42.55 <1e-99 0.0481155 0.0527627
_Ibirthyr_1946 0.124416 0.00121649 102.27 <1e-99 0.122032 0.1268
_Ibirthyr_1947 0.124994 0.00123887 100.89 <1e-99 0.122565 0.127422
_Ibirthyr_1948 0.126589 0.00116365 108.79 <1e-99 0.124309 0.12887
_Ibirthyr_1949 0.127993 0.00115564 110.76 <1e-99 0.125728 0.130258
_Ibirthyr_1950 0.132723 0.00103157 128.66 <1e-99 0.130701 0.134744
_Ibirthyr_1951 0.134958 0.00106753 126.42 <1e-99 0.132866 0.137051
_Ibirthyr_1952 0.134466 0.00105115 127.92 <1e-99 0.132405 0.136526
_Ibirthyr_1953 0.135625 0.000993632 136.49 <1e-99 0.133677 0.137572
_Ibirthyr_1954 0.137397 0.000973637 141.12 <1e-99 0.135489 0.139306
_Ibirthyr_1955 0.0866432 0.000915904 94.60 <1e-99 0.084848 0.0884383
_Ibirthyr_1956 0.0928981 0.000912533 101.80 <1e-99 0.0911095 0.0946866
_Ibirthyr_1957 0.0974438 0.000906342 107.51 <1e-99 0.0956674 0.0992202
_Ibirthyr_1958 0.102582 0.0008524 120.34 <1e-99 0.100911 0.104252
_Ibirthyr_1959 0.113558 0.000846193 134.20 <1e-99 0.111899 0.115216
_Ibirthyr_1960 0.115621 0.000770211 150.12 <1e-99 0.114111 0.11713
_Ibirthyr_1962 0.137555 0.00080227 171.46 <1e-99 0.135982 0.139127
_Ibirthyr_1963 0.141234 0.000766621 184.23 <1e-99 0.139731 0.142736
_Ibirthyr_1964 0.14889 0.000773525 192.48 <1e-99 0.147374 0.150406
_Ibirthyr_1965 0.149612 0.000727619 205.62 <1e-99 0.148186 0.151038
_Ibirthyr_1966 0.151283 0.000754572 200.49 <1e-99 0.149804 0.152762
_Ibirthyr_1967 0.150603 0.000776434 193.97 <1e-99 0.149081 0.152125
_Ibirthyr_1968 0.150456 0.000748029 201.14 <1e-99 0.14899 0.151922
_Ibirthyr_1969 0.151651 0.000731701 207.26 <1e-99 0.150217 0.153085
_Ibirthyr_1970 0.149829 0.0007037 212.92 <1e-99 0.148449 0.151208
_Ibirthyr_1971 0.150174 0.00073155 205.28 <1e-99 0.14874 0.151608
_Ibirthyr_1972 0.151469 0.000720704 210.17 <1e-99 0.150056 0.152881
_Ibirthyr_1973 0.148427 0.000722602 205.41 <1e-99 0.147011 0.149844
_Ibirthyr_1974 0.157152 0.000740608 212.19 <1e-99 0.1557 0.158603
_Ibirthyr_1975 0.15763 0.00072148 218.48 <1e-99 0.156216 0.159044
_Ibirthyr_1976 0.163673 0.000746944 219.12 <1e-99 0.162209 0.165137
_Ibirthyr_1977 0.169546 0.000769795 220.25 <1e-99 0.168038 0.171055
_Ibirthyr_1978 0.174651 0.000788054 221.62 <1e-99 0.173106 0.176195
_Ibirthyr_1979 0.178854 0.00080135 223.19 <1e-99 0.177283 0.180424
_Ibirthyr_1980 0.178202 0.000764457 233.11 <1e-99 0.176704 0.1797
_Ibirthyr_1981 0.190883 0.000807391 236.42 <1e-99 0.1893 0.192465
_Ibirthyr_1982 0.192958 0.000789277 244.47 <1e-99 0.191411 0.194505
_Ibirthyr_1983 0.153268 0.000808058 189.67 <1e-99 0.151684 0.154852
_Ibirthyr_1984 0.153286 0.000839944 182.50 <1e-99 0.15164 0.154932
_Ibirthyr_1985 0.157197 0.000868201 181.06 <1e-99 0.155495 0.158898
_Ibirthyr_1986 0.162169 0.000918852 176.49 <1e-99 0.160368 0.16397
_Ibirthyr_1987 0.166885 0.000982972 169.78 <1e-99 0.164958 0.168811
_Ibirthyr_1988 0.168767 0.00103866 162.49 <1e-99 0.166732 0.170803
_Ibirthyr_1989 0.169901 0.00112845 150.56 <1e-99 0.167689 0.172113
_Ibirthyr_1990 0.165285 0.00120854 136.76 <1e-99 0.162916 0.167654
_Ibirthyr_1991 0.167873 0.00136576 122.92 <1e-99 0.165196 0.17055
_Ibirthyr_1992 0.260221 0.0015163 171.62 <1e-99 0.257249 0.263193
_Ibirthyr_1993 0.263708 0.00162497 162.28 <1e-99 0.260523 0.266892
_Ibirthyr_1994 0.27971 0.00168289 166.21 <1e-99 0.276412 0.283008
_Ibirthyr_1995 0.287836 0.00175984 163.56 <1e-99 0.284386 0.291285
_Ibirthyr_1996 0.29295 0.00187822 155.97 <1e-99 0.289269 0.296632
_Ibirthyr_1997 0.290477 0.00212673 136.58 <1e-99 0.286308 0.294645
_Ibirthyr_1998 0.283585 0.00260688 108.78 <1e-99 0.278475 0.288694
_Ibirthyr_1999 0.271408 0.00355811 76.28 <1e-99 0.264434 0.278381
_Ibirthyr_2000 0.285458 0.00848653 33.64 <1e-99 0.268825 0.302091
─────────────────────────────────────────────────────────────────────────────────
. #delimit ;
delimiter now ;
. reg part _Ibirthyr_1907 _Ibirthyr_1908 _Ibirthyr_1909 _Ibirthyr_1910 _Ibirthyr_1911 _Ibirthyr_1912 _Ibirthyr_1913 _Ibirthyr_1914
> _Ibirthyr_1915 _Ibirthyr_1916 _Ibirthyr_1917 _Ibirthyr_1918 _Ibirthyr_1919 _Ibirthyr_1920 _Ibirthyr_1921 _Ibirthyr_1922 _Ibirthyr_1923 _Ibirthyr_1924
> _Ibirthyr_1925 _Ibirthyr_1926 _Ibirthyr_1927 _Ibirthyr_1928 _Ibirthyr_1929 _Ibirthyr_1930 _Ibirthyr_1931 _Ibirthyr_1932 _Ibirthyr_1933 _Ibirthyr_1934
> _Ibirthyr_1935 _Ibirthyr_1936 _Ibirthyr_1937 _Ibirthyr_1938 _Ibirthyr_1939 _Ibirthyr_1940 _Ibirthyr_1941 _Ibirthyr_1942 _Ibirthyr_1943 _Ibirthyr_1944
> _Ibirthyr_1945 _Ibirthyr_1946 _Ibirthyr_1947 _Ibirthyr_1948 _Ibirthyr_1949 _Ibirthyr_1950 _Ibirthyr_1951 _Ibirthyr_1952 _Ibirthyr_1953 _Ibirthyr_1954
> _Ibirthyr_1955 _Ibirthyr_1956 _Ibirthyr_1957 _Ibirthyr_1958 _Ibirthyr_1959 _Ibirthyr_1960 _Ibirthyr_1962 _Ibirthyr_1963 _Ibirthyr_1964
> _Ibirthyr_1965 _Ibirthyr_1966 _Ibirthyr_1967 _Ibirthyr_1968 _Ibirthyr_1969 _Ibirthyr_1970 _Ibirthyr_1971 _Ibirthyr_1972 _Ibirthyr_1973 _Ibirthyr_1974
> _Ibirthyr_1975 _Ibirthyr_1976 _Ibirthyr_1977 _Ibirthyr_1978 _Ibirthyr_1979 _Ibirthyr_1980 _Ibirthyr_1981 _Ibirthyr_1982 _Ibirthyr_1983 _Ibirthyr_1984
> _Ibirthyr_1985 _Ibirthyr_1986 _Ibirthyr_1987 _Ibirthyr_1988 _Ibirthyr_1989 _Ibirthyr_1990 _Ibirthyr_1991 _Ibirthyr_1992 _Ibirthyr_1993 _Ibirthyr_1994
> _Ibirthyr_1995 _Ibirthyr_1996 _Ibirthyr_1997 _Ibirthyr_1998 _Ibirthyr_1999 _Ibirthyr_2000;
coefplot, vertical;
Source | SS df MS Number of obs = 8,830,997
-------------+---------------------------------- F(93, 8830903) = 2891.59
Model | 32017.6199 93 344.275483 Prob > F = 0.0000
Residual | 1051414.41 8,830,903 .119060804 R-squared = 0.0296
-------------+---------------------------------- Adj R-squared = 0.0295
Total | 1083432.03 8,830,996 .122685146 Root MSE = .34505
--------------------------------------------------------------------------------
part | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------+----------------------------------------------------------------
_Ibirthyr_1907 | -.1142327 .0398512 -2.87 0.004 -.1923397 -.0361257
_Ibirthyr_1908 | -.1074318 .0200044 -5.37 0.000 -.1466397 -.0682239
_Ibirthyr_1909 | -.1243087 .0197095 -6.31 0.000 -.1629386 -.0856789
_Ibirthyr_1910 | -.1150487 .0128932 -8.92 0.000 -.1403189 -.0897784
_Ibirthyr_1911 | -.1196295 .0137706 -8.69 0.000 -.1466193 -.0926397
_Ibirthyr_1912 | -.1174395 .0123026 -9.55 0.000 -.141552 -.0933269
_Ibirthyr_1913 | -.113732 .0108762 -10.46 0.000 -.135049 -.0924151
_Ibirthyr_1914 | -.116366 .0097924 -11.88 0.000 -.1355588 -.0971732
_Ibirthyr_1915 | -.1197267 .0076797 -15.59 0.000 -.1347787 -.1046748
_Ibirthyr_1916 | -.1136651 .0072363 -15.71 0.000 -.1278479 -.0994823
_Ibirthyr_1917 | -.1109562 .0071659 -15.48 0.000 -.1250011 -.0969112
_Ibirthyr_1918 | -.113995 .0056855 -20.05 0.000 -.1251384 -.1028516
_Ibirthyr_1919 | -.1111041 .0054677 -20.32 0.000 -.1218207 -.1003875
_Ibirthyr_1920 | -.109057 .0045673 -23.88 0.000 -.1180088 -.1001051
_Ibirthyr_1921 | -.107949 .0043795 -24.65 0.000 -.1165326 -.0993655
_Ibirthyr_1922 | -.1052824 .0044731 -23.54 0.000 -.1140495 -.0965152
_Ibirthyr_1923 | -.1063054 .0039412 -26.97 0.000 -.11403 -.0985808
_Ibirthyr_1924 | -.1023819 .0038453 -26.62 0.000 -.1099186 -.0948452
_Ibirthyr_1925 | -.1046705 .0032828 -31.88 0.000 -.1111046 -.0982363
_Ibirthyr_1926 | -.1019359 .0032783 -31.09 0.000 -.1083613 -.0955105
_Ibirthyr_1927 | -.1033336 .0030989 -33.35 0.000 -.1094073 -.09726
_Ibirthyr_1928 | -.1008986 .0027441 -36.77 0.000 -.106277 -.0955202
_Ibirthyr_1929 | -.1003687 .0026879 -37.34 0.000 -.1056368 -.0951006
_Ibirthyr_1930 | -.098973 .002319 -42.68 0.000 -.1035182 -.0944278
_Ibirthyr_1931 | -.0982874 .0023263 -42.25 0.000 -.1028469 -.0937279
_Ibirthyr_1932 | -.0985936 .0022583 -43.66 0.000 -.1030197 -.0941674
_Ibirthyr_1933 | -.0979476 .0020767 -47.17 0.000 -.1020178 -.0938774
_Ibirthyr_1934 | -.0961892 .0020357 -47.25 0.000 -.1001792 -.0921993
_Ibirthyr_1935 | -.0949049 .0018631 -50.94 0.000 -.0985565 -.0912533
_Ibirthyr_1936 | -.0913585 .0018495 -49.40 0.000 -.0949835 -.0877335
_Ibirthyr_1937 | -.0919038 .0018425 -49.88 0.000 -.095515 -.0882925
_Ibirthyr_1938 | -.0896477 .0017315 -51.77 0.000 -.0930413 -.086254
_Ibirthyr_1939 | -.0875577 .0017009 -51.48 0.000 -.0908914 -.084224
_Ibirthyr_1940 | -.0877936 .0015558 -56.43 0.000 -.0908429 -.0847442
_Ibirthyr_1941 | -.083425 .0016043 -52.00 0.000 -.0865694 -.0802807
_Ibirthyr_1942 | -.0826104 .0015576 -53.04 0.000 -.0856632 -.0795577
_Ibirthyr_1943 | -.0809374 .0015219 -53.18 0.000 -.0839203 -.0779545
_Ibirthyr_1944 | -.0783493 .0015582 -50.28 0.000 -.0814033 -.0752954
_Ibirthyr_1945 | -.0771269 .001426 -54.09 0.000 -.0799218 -.074332
_Ibirthyr_1946 | -.0727114 .0014516 -50.09 0.000 -.0755564 -.0698664
_Ibirthyr_1947 | -.0709548 .0014702 -48.26 0.000 -.0738363 -.0680733
_Ibirthyr_1948 | -.065841 .0014081 -46.76 0.000 -.0686007 -.0630812
_Ibirthyr_1949 | -.0628813 .0014015 -44.87 0.000 -.0656282 -.0601344
_Ibirthyr_1950 | -.063654 .0013022 -48.88 0.000 -.0662063 -.0611017
_Ibirthyr_1951 | -.054803 .0013306 -41.19 0.000 -.0574109 -.0521951
_Ibirthyr_1952 | -.0521713 .0013176 -39.60 0.000 -.0547537 -.0495888
_Ibirthyr_1953 | -.0479237 .0012727 -37.66 0.000 -.0504182 -.0454293
_Ibirthyr_1954 | -.042377 .0012573 -33.70 0.000 -.0448413 -.0399127
_Ibirthyr_1955 | -.0409229 .0012136 -33.72 0.000 -.0433016 -.0385442
_Ibirthyr_1956 | -.034668 .0012111 -28.62 0.000 -.0370417 -.0322942
_Ibirthyr_1957 | -.0301223 .0012065 -24.97 0.000 -.032487 -.0277575
_Ibirthyr_1958 | -.0249844 .001167 -21.41 0.000 -.0272717 -.0226972
_Ibirthyr_1959 | -.0140082 .0011625 -12.05 0.000 -.0162867 -.0117298
_Ibirthyr_1960 | -.0119452 .001109 -10.77 0.000 -.0141188 -.0097716
_Ibirthyr_1962 | .0099888 .0011313 8.83 0.000 .0077715 .012206
_Ibirthyr_1963 | .0136677 .0011065 12.35 0.000 .0114989 .0158365
_Ibirthyr_1964 | .021324 .0011113 19.19 0.000 .0191459 .0235021
_Ibirthyr_1965 | .0182448 .0010802 16.89 0.000 .0161277 .020362
_Ibirthyr_1966 | .0262808 .0010983 23.93 0.000 .0241282 .0284335
_Ibirthyr_1967 | .0273999 .0011133 24.61 0.000 .0252179 .0295819
_Ibirthyr_1968 | .0292593 .0010939 26.75 0.000 .0271154 .0314033
_Ibirthyr_1969 | .0313718 .0010829 28.97 0.000 .0292493 .0334943
_Ibirthyr_1970 | .0274142 .0010644 25.76 0.000 .025328 .0295004
_Ibirthyr_1971 | .0323644 .0010828 29.89 0.000 .0302422 .0344867
_Ibirthyr_1972 | .0297249 .0010756 27.64 0.000 .0276168 .031833
_Ibirthyr_1973 | .0254691 .0010768 23.65 0.000 .0233585 .0275797
_Ibirthyr_1974 | .0295858 .0010889 27.17 0.000 .0274516 .03172
_Ibirthyr_1975 | .0300637 .0010761 27.94 0.000 .0279546 .0321728
_Ibirthyr_1976 | .0361066 .0010932 33.03 0.000 .033964 .0382491
_Ibirthyr_1977 | .0419803 .0011087 37.86 0.000 .0398073 .0441534
_Ibirthyr_1978 | .0470845 .0011213 41.99 0.000 .0448867 .0492823
_Ibirthyr_1979 | .0512878 .0011306 45.36 0.000 .0490718 .0535038
_Ibirthyr_1980 | .0506358 .0011051 45.82 0.000 .0484699 .0528017
_Ibirthyr_1981 | .0633167 .0011349 55.79 0.000 .0610924 .065541
_Ibirthyr_1982 | .0653921 .0011222 58.27 0.000 .0631926 .0675915
_Ibirthyr_1983 | .0718306 .0011353 63.27 0.000 .0696054 .0740559
_Ibirthyr_1984 | .078832 .001158 68.08 0.000 .0765624 .0811017
_Ibirthyr_1985 | .086207 .0011784 73.15 0.000 .0838973 .0885167
_Ibirthyr_1986 | .0991375 .0012158 81.54 0.000 .0967545 .1015205
_Ibirthyr_1987 | .1079642 .0012645 85.38 0.000 .1054859 .1104426
_Ibirthyr_1988 | .1152806 .0013078 88.15 0.000 .1127174 .1178438
_Ibirthyr_1989 | .116699 .0013794 84.60 0.000 .1139954 .1194026
_Ibirthyr_1990 | .1190085 .001445 82.36 0.000 .1161764 .1218406
_Ibirthyr_1991 | .1272271 .0015775 80.65 0.000 .1241354 .1303189
_Ibirthyr_1992 | .1326548 .0017081 77.66 0.000 .129307 .1360026
_Ibirthyr_1993 | .1361415 .0018043 75.46 0.000 .1326053 .1396778
_Ibirthyr_1994 | .152144 .001856 81.97 0.000 .1485063 .1557818
_Ibirthyr_1995 | .1602695 .0019253 83.24 0.000 .1564959 .1640431
_Ibirthyr_1996 | .1653843 .002033 81.35 0.000 .1613997 .1693688
_Ibirthyr_1997 | .1629105 .0022622 72.01 0.000 .1584766 .1673444
_Ibirthyr_1998 | .1560186 .0027141 57.49 0.000 .1506991 .1613381
_Ibirthyr_1999 | .1438415 .0036287 39.64 0.000 .1367293 .1509537
_Ibirthyr_2000 | .1578918 .008479 18.62 0.000 .1412731 .1745104
_cons | .127566 .0008019 159.08 0.000 .1259944 .1291377
--------------------------------------------------------------------------------
I have now also encountered this issue and reproduced a minimum working example as it seems the conversation has died down without the underlying issue being fixed. I've attached a csv with data; in my case, I'm working with chunks of 300 data points at a time.
using GLM
using StatsModels
using DataFrames
using Plots
data = CSV.read("Failed_Fit.csv",DataFrame)
ols_fail = lm(@formula(Y ~ X), data)
println(ols_fail)
coefs_fail = coef(ols_fail)
failed_fit=[]
for t in data.X
sample = coefs_fail[1] + t*coefs_fail[2]
append!(failed_fit,sample)
end
ols_success = lm(@formula(Y ~ X), data,dropcollinear=false)
println(ols_success)
coefs_success = coef(ols_success)
successful_fit=[]
for t in data.X
sample = coefs_success[1] + t*coefs_success[2]
append!(successful_fit,sample)
end
Plots.plot(data.X,data.Y,seriestype=:scatter,markerstrokewidth=0,ms=5,label="Data")
Plots.plot!(data.X,successful_fit,linewidth=2,label="Successful Fit")
Plots.plot!(data.X,failed_fit,linewidth=2,label="Failed Fit")
Plots.savefig("Fixed_vs_failed.png")
This produced the following graph.
While the work-around of setting dropcollinear=false
works, I'm a little frustrated to find that my issue is something that has been known about for over three years. Why hasn't this been resolved yet?
The model matrix is poorly conditioned, as your X values are all between 76383 and 76682. You are trying to estimate an intercept that is very far away from the observed range of the X values, which leads to collinearity - just as it did three years ago.
julia> df = CSV.read("/Users/dmbates/Downloads/Failed_Fit.csv", DataFrame);
julia> X = hcat(ones(nrow(df)), df.X);
julia> cond(qr(X).R)
6.763385429589838e7
What do you want to do with the model that you would fit and can it be accomplished by centering the X values?
I primarily want to extract the slope from the fit as the rate of change is the relevant parameter to other analysis I'm doing (I'm being a little bit intentionally vague here) so I'd expect centering the X values would work actually. Let me give it a shot and I'll report back.
Indeed, centering the data also provides the proper slope coefficient. With two possible workarounds (dropcollinear=false
or centering the data at/near 0), is one approach better than the other?
I would tend to prefer centering as long as it still gives you coefficients with the desired interpretation.
We've been a bit slow wrapping up a few loose ends, but I'll try to allocate some time in the next few weeks to get a release with the new QR-based solver out -- it's a bit slower but more numerically stable.
Car-Training.csv
This is a bit urgent.
In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models.
Here are the Julia codes:
The output is the following:
The intercept is not estimated. The other two coefficients are not correct.
I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below:
Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?