Why can lm_robust perform cholesky decomposition with clustered standard errors when other libraries cannot?

DeFilippis commented 5 years ago

I am running a model that fails in felm, gee and lme4 due to matrix singularities. I only discovered this today, because I've been successfully running my models with lm_robust for the last several months. I discovered in the documentation that lm_robust deals with rank deficiencies in the following way: "when X is rank deficient, there are certain conditions under which the QR factorization algorithm we use, from the Eigen C++ library, drops different coefficients from the output than the default lm function."

Is there any way to print which coefficients are linearly dependent, and which are dropped in the analysis?

nfultz commented 5 years ago

IIRC, estimatr will drop the left-most duplicates in the design matrix - which is different from lm, which uses a non-standard implementation to drop the right-most.

I think that the dropped coefficients should show up as NA, though.

DeFilippis commented 5 years ago

It appears you're right, as the model runs even if I do try_cholesky = TRUE, implying lm_robust notices no singularity issues.

This is strange, because if I run analogous models in different libraries, it will throw errors with cholesky decomposition. See the following:

GEE library:

gee(outgroup_feelings_diff ~ treatment, id = teamID, data = data)

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
 running glm to get initial regression estimate
 (Intercept) treatmentDD treatmentDR treatmentRD 
   0.7220000  -0.1572857   5.4177500   6.0612500 
 Error in gee(outgroup_feelings_diff ~ treatment, id = teamID  : 
   NA/NaN/Inf in foreign function call (arg 3)
 In addition: Warning message:
 In gee(outgroup_feelings_diff ~ treatment, id = teamID,  :
   NAs introduced by coercion

FELM library:

felm(outgroup_feelings_diff ~ treatment | teamId | 0 | 0, data = data)

 treatmentDD treatmentDR treatmentRD 
         NaN     -0.6435         NaN 
 Warning message:
 In chol.default(mat, pivot = TRUE, tol = tol) :
   the matrix is either rank-deficient or indefinite

LM_ROBUST

lm_robust(outgroup_feelings_diff ~ treatment, 
          cluster = teamID, 
          try_cholesky = TRUE, 
          data = data)

              Estimate Std. Error    t value    Pr(>|t|)   CI Lower  CI Upper       DF
 (Intercept)  0.7220000  0.8047668  0.8971543 0.375141611 -0.9057945  2.349795 39.00000
 treatmentDD -0.1572857  1.2454367 -0.1262896 0.899856236 -2.6402053  2.325634 71.68429
 treatmentDR  5.4177500  1.7384225  3.1164749 0.002651691  1.9507494  8.884751 70.20000
 treatmentRD  6.0612500  2.6218889  2.3117875 0.023728531  0.8323194 11.290181 70.20000

What is it about lm_robust that is allowing it to do cholesky decomposition without dropping any columns, whereas other libraries cannot?

@nfultz

nfultz commented 5 years ago

I think that may be because try_cholesky will fall back to QR when the cholesky decomp fails:

https://github.com/DeclareDesign/estimatr/blob/c889505d9ab0181a9502e35bc0688d9fc4ce6d72/src/lm_robust_helper.cpp#L181-L183

DeFilippis commented 5 years ago

Ah. Good point. Doesn't it seem like the regression should output that cholesky failed before going to QR so that the user is aware that they have a singular matrix? I would have never known unless I tried with two other libraries.

In any case, shouldn't QR still be dropping one of the columns (the left-most?) . On the output above you can see that no columns are dropped. How can it get standard errors for each of the estimates if the matrix is singular?

lukesonnet commented 5 years ago

Thanks for bringing this example to our attention. I'd sooner remove try_cholesky than have it warn, to be honest. It's not something I really think we should support. As the argument name suggests, it's just going to try to speed things up with a cholesky decomposition.

However, as for the QR decomposition yeilding estimates when other methods do not, it's a bit hard for me to diagnose without a replicable example. Are you able to share the data or some simulated or subsample of data that replicates the problem?

e: I should add that in many cases the QR decomposition from Eigen that we rely on does drop linearly dependent columns in toy examples and in some less trivial examples I've encountered in data analysis. This shows up as a message and a row of NAs in the resulting output, as in lm(). So to be clear the answer to your initial issue question is "lm_robust does print the dropped, linearly dependent columns of the design matrix."

And I'm asking for your data to diagnose why in your case the QR decomposition is not dropping any columns.

Just wanted to clarify.

DeFilippis commented 5 years ago

I'm trying to figure out how to simulate the problem, but the issue is I'm not sure what is actually causing it! Will continue to dig around and submit something soon.

lukesonnet commented 5 years ago

I'm going to close this unless there's a workable example that can be posted, at which time I'll reopen. I think it's buyer beware with the non-standard decomposition, and we aren't returning "incorrect" results, just perhaps unexpected results.

DeclareDesign / estimatr

Why can lm_robust perform cholesky decomposition with clustered standard errors when other libraries cannot? #279