DeclareDesign / estimatr

estimatr: Fast Estimators for Design-Based Inference
https://declaredesign.org/r/estimatr
Other
131 stars 20 forks source link

NA estimation with lm_robust() but not with lm() #338

Closed orlando-sabogal closed 4 years ago

orlando-sabogal commented 4 years ago

I am fitting a linear model with a categorical variable including above 150 categories. Using lm_robust I get the following message "27 coefficients not defined because the design matrix is rank deficient" and some parameters are fitted as NA values. This happens whether or not standar errors are clusterized.

I know that I do not have multicolinearity problems. Moreover, I do not have similar messages when using base lm() function and all the varaibles are estimated (not NA).

Any hints on what can be happening? Is this related to #305 ??

nfultz commented 4 years ago

Can you please post an example and your session info? Are you using the high-cardinality categorical variable in the model formula, or absorbing them with FE ?

orlando-sabogal commented 4 years ago

I requested permission to share an instance of the data (is part of a research and we are not yet sharing data and results) so you can reproduce the issue.

In the meantime:

The datasets has 12264 rows and 8 columns.

The regression equation is:

Equation <- V8 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7

where V5 is a categorical variable of 159 categories. V1 is a binary categorical variable. V2 is a categorical variable with 19 categories (not an individual fixed-effect, but a higher aggregation fixed effect. This means that the individuals are cities and V2 is countries). V2 is a categorical variable with 27 categories (for seasonality)

When I do:

Model <- lm_robust(Equation, data = DataTest, clusters = V2, se_type = "stata")

Some parameters have NA values. Is the same if I do not cluster the erros. As the output is too long, an easy way to get the message is through:

ModelResutl <- broom::tidy(Model)

But his works:

Model_lm <- lm(Equation, data = DataTest) ModelResutl <- broom::tidy(Model_lm) The sesssion:

`R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils
[5] datasets methods base

other attached packages: [1] estimatr_0.20.0 clubSandwich_0.4.0 [3] sandwich_2.5-1 jtools_2.0.1
[5] broom_0.5.2 kableExtra_1.1.0
[7] knitr_1.25 magrittr_1.5
[9] forcats_0.4.0 stringr_1.4.0
[11] dplyr_0.8.3 purrr_0.3.3
[13] readr_1.3.1 tidyr_1.0.0
[15] tibble_2.1.3 ggplot2_3.2.1
[17] tidyverse_1.3.0

loaded via a namespace (and not attached): [1] zoo_1.8-6 tidyselect_0.2.5 [3] xfun_0.10 pander_0.6.3
[5] haven_2.2.0 lattice_0.20-38
[7] colorspace_1.4-1 vctrs_0.2.0
[9] generics_0.0.2 viridisLite_0.3.0 [11] htmltools_0.4.0 rlang_0.4.1
[13] pillar_1.4.2 glue_1.3.1
[15] withr_2.1.2 DBI_1.0.0
[17] dbplyr_1.4.2 modelr_0.1.5
[19] readxl_1.3.1 lifecycle_0.1.0
[21] munsell_0.5.0 gtable_0.3.0
[23] cellranger_1.1.0 rvest_0.3.5
[25] evaluate_0.14 Rcpp_1.0.2
[27] scales_1.0.0 backports_1.1.5
[29] webshot_0.5.2 jsonlite_1.6
[31] fs_1.3.1 hms_0.5.2
[33] digest_0.6.22 stringi_1.4.3
[35] grid_3.6.1 cli_1.1.0
[37] tools_3.6.1 lazyeval_0.2.2
[39] Formula_1.2-3 crayon_1.3.4
[41] pkgconfig_2.0.3 zeallot_0.1.0
[43] xml2_1.2.2 reprex_0.3.0
[45] lubridate_1.7.4 assertthat_0.2.1 [47] rmarkdown_1.16 httr_1.4.1
[49] rstudioapi_0.10 R6_2.4.0
[51] nlme_3.1-140 compiler_3.6.1 `

orlando-sabogal commented 4 years ago

I am sorry to bother with this. I actually have a variable for month and another variable for semester, therefore I do have a colinearity.