DeclareDesign / estimatr

estimatr: Fast Estimators for Design-Based Inference
https://declaredesign.org/r/estimatr
Other
131 stars 20 forks source link

issue when using !is.na(var) in formula for lm_lin #283

Closed graemeblair closed 5 years ago

graemeblair commented 5 years ago
df <- fabricate(
 N = 100,
 Z = rbinom(N, 1, .5),
 cov = rbinom(N, 1, .4),
 missing_cov = ifelse(cov, 1, NA),
 Y = Z * .2 + cov
)

> lm_lin(Y ~ Z, covariates = ~ !is.na(missing_cov), data = df)
1 coefficient  not defined because the design matrix is rank deficient

Estimate   Std. Error      t value     Pr(>|t|)  CI Lower  CI Upper DF
(Intercept)                       0.1500000 3.468057e-10 4.325189e+08 0.000000e+00 0.1500000 0.1500000 97
Z                                 0.5846154 6.812422e-02 8.581608e+00 1.535766e-13 0.4494077 0.7198231 97
(!is.na(missing_cov) + ZTRUE)_c   1.0000000 9.352653e-10 1.069215e+09 0.000000e+00 1.0000000 1.0000000 97
Z:(!is.na(missing_cov) + ZTRUE)_c        NA           NA           NA           NA        NA        NA NA

Is this expected? Looks like it might be interacting with how we modify the formula for the lin estimator, i.e. !is.na(missing_cov) + ZTRUE is odd.

lukesonnet commented 5 years ago

It is not. I'm still not sure how this happened, but it is the result of duplicating the response in the full_formula. That was unnecessary and I've now fixed it in #290. All tests still pass, and a test was added to catch this problem.

As an aside, when you are entering functions to be evaluated in a formula, I strongly suggest you use I(). For example, this bug doesn't exist if you do:

lm_lin(Y ~ Z, covariates = ~ I(!is.na(missing_cov)), data = df)