Closed orlando-sabogal closed 4 years ago
Can you please post an example and your session info? Are you using the high-cardinality categorical variable in the model formula, or absorbing them with FE ?
I requested permission to share an instance of the data (is part of a research and we are not yet sharing data and results) so you can reproduce the issue.
In the meantime:
The datasets has 12264 rows and 8 columns.
The regression equation is:
Equation <- V8 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7
where V5 is a categorical variable of 159 categories. V1 is a binary categorical variable. V2 is a categorical variable with 19 categories (not an individual fixed-effect, but a higher aggregation fixed effect. This means that the individuals are cities and V2 is countries). V2 is a categorical variable with 27 categories (for seasonality)
When I do:
Model <- lm_robust(Equation, data = DataTest, clusters = V2, se_type = "stata")
Some parameters have NA values. Is the same if I do not cluster the erros. As the output is too long, an easy way to get the message is through:
ModelResutl <- broom::tidy(Model)
But his works:
Model_lm <- lm(Equation, data = DataTest) ModelResutl <- broom::tidy(Model_lm)
The sesssion:
`R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils
[5] datasets methods base
other attached packages:
[1] estimatr_0.20.0 clubSandwich_0.4.0
[3] sandwich_2.5-1 jtools_2.0.1
[5] broom_0.5.2 kableExtra_1.1.0
[7] knitr_1.25 magrittr_1.5
[9] forcats_0.4.0 stringr_1.4.0
[11] dplyr_0.8.3 purrr_0.3.3
[13] readr_1.3.1 tidyr_1.0.0
[15] tibble_2.1.3 ggplot2_3.2.1
[17] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] zoo_1.8-6 tidyselect_0.2.5
[3] xfun_0.10 pander_0.6.3
[5] haven_2.2.0 lattice_0.20-38
[7] colorspace_1.4-1 vctrs_0.2.0
[9] generics_0.0.2 viridisLite_0.3.0
[11] htmltools_0.4.0 rlang_0.4.1
[13] pillar_1.4.2 glue_1.3.1
[15] withr_2.1.2 DBI_1.0.0
[17] dbplyr_1.4.2 modelr_0.1.5
[19] readxl_1.3.1 lifecycle_0.1.0
[21] munsell_0.5.0 gtable_0.3.0
[23] cellranger_1.1.0 rvest_0.3.5
[25] evaluate_0.14 Rcpp_1.0.2
[27] scales_1.0.0 backports_1.1.5
[29] webshot_0.5.2 jsonlite_1.6
[31] fs_1.3.1 hms_0.5.2
[33] digest_0.6.22 stringi_1.4.3
[35] grid_3.6.1 cli_1.1.0
[37] tools_3.6.1 lazyeval_0.2.2
[39] Formula_1.2-3 crayon_1.3.4
[41] pkgconfig_2.0.3 zeallot_0.1.0
[43] xml2_1.2.2 reprex_0.3.0
[45] lubridate_1.7.4 assertthat_0.2.1
[47] rmarkdown_1.16 httr_1.4.1
[49] rstudioapi_0.10 R6_2.4.0
[51] nlme_3.1-140 compiler_3.6.1 `
I am sorry to bother with this. I actually have a variable for month and another variable for semester, therefore I do have a colinearity.
I am fitting a linear model with a categorical variable including above 150 categories. Using lm_robust I get the following message "27 coefficients not defined because the design matrix is rank deficient" and some parameters are fitted as NA values. This happens whether or not standar errors are clusterized.
I know that I do not have multicolinearity problems. Moreover, I do not have similar messages when using base lm() function and all the varaibles are estimated (not NA).
Any hints on what can be happening? Is this related to #305 ??