Understanding the "Not enough control units for group"

MatthieuStigler commented 3 years ago

Hi

Thanks a lot for the great package, this is very useful!

We are facing repeated issues when running did::att_gt with a discrete covariate, leading to warnings like: Not enough control units for group 1936 in time period 1917 to run specified regression (see regexp below). Would you kindly help us get intuition on what the specific problem is? It seems it is more about having too many covariate factors rather than too few units?

My understanding is that in the period-by-period diff-diff, this happens whenever factors of a discrete variable are unbalanced between the treatment and control group, meaning one factor might occur only in the treatment group? I take it from seeing that you run rcond(t(control_covs)%*%control_covs) on the control group, which would return 0 in the case a factor was found in the treatment (and hence expanded into columns by model.matrix()) , yet not in the control (column is then 0)? Intuitively I thought this would not be an issue for the period-by-period estimation, though I realize I don't fully understand how you are transforming the panel model with time-varying covariates into a model with pre-treatment covariates to be used by DRDID::reg_did_panel? Could you kindly elaborate on this?

Thanks!

Code:

# version as of March 30, 2021
devtools::install_github("bcallaway11/did", ref="c175ee0e3e022445c94a785ba3e880bf649b47f8")
#> Skipping install of 'did' from a github remote, the SHA1 (c175ee0e) has not changed since last install.
#>   Use `force = TRUE` to force installation
packageVersion("did")
#> [1] '2.0.1.903'

library(did)

G <- 20
N_g <- 10
t <- 50
time <- 1900:(1900+t-1)
group <- 1:G
treat <- c(1930:(1930+G-10), rep(0, 9))
ind_id <- as.vector(outer(letters, letters, FUN = paste, sep="_"))[1:(N_g*G)]

set.seed(1814)
data <- tibble::tibble(period= rep(time, G*N_g),
                       id = rep(1:(N_g*G), each=t),
                       treat = rep(treat, each=t*N_g),
                       x = rnorm(G*t*N_g),
                       x2 = sample(letters, size=G*t*N_g, replace=TRUE),
                       Y = 1.1+0.8*x+0.2*treat+rnorm(G*t*N_g))

data
#> # A tibble: 10,000 x 6
#>    period    id treat      x x2        Y
#>     <int> <int> <dbl>  <dbl> <chr> <dbl>
#>  1   1900     1  1930 -0.876 a      386.
#>  2   1901     1  1930  0.975 p      388.
#>  3   1902     1  1930 -0.874 s      386.
#>  4   1903     1  1930  0.433 s      389.
#>  5   1904     1  1930  0.159 a      387.
#>  6   1905     1  1930  0.510 m      388.
#>  7   1906     1  1930  1.23  m      388.
#>  8   1907     1  1930 -1.92  k      386.
#>  9   1908     1  1930 -0.403 w      387.
#> 10   1909     1  1930 -0.512 z      388.
#> # … with 9,990 more rows

##
example_attgt <- did::att_gt(yname = "Y",
                        tname = "period",
                        idname = "id",
                        gname = "treat",
                        xformla = ~x+x2,
                        data = data,
                        est_method = "reg",
                        bstrap=FALSE, cband=FALSE)
#> Warning in compute.att_gt(dp): Not enough control units for group 1930 in time
#> period 1906 to run specified regression
#> Warning in compute.att_gt(dp): Not enough control units for group 1930 in time
#> period 1909 to run specified regression
#> Warning in compute.att_gt(dp): Not enough control units for group 1930 in time
#> period 1915 to run specified regression
#> Warning in compute.att_gt(dp): Not enough control units for group 1930 in time
##----- skip warnings
#> "treat", : Not returning pre-test Wald statistic due to singular covariance
#> matrix

head(broom::tidy(example_attgt))
#>             term group time   estimate std.error   conf.low conf.high
#> 1 ATT(1930,1901)  1930 1901  0.5576056 0.5633557 -0.5465512 1.6617624
#> 2 ATT(1930,1902)  1930 1902 -0.1429263 0.7481639 -1.6093007 1.3234481
#> 3 ATT(1930,1903)  1930 1903 -0.2518773 0.6243255 -1.4755328 0.9717783
#> 4 ATT(1930,1904)  1930 1904 -0.1720505 0.5367777 -1.2241156 0.8800145
#> 5 ATT(1930,1905)  1930 1905 -0.6301744 0.6633861 -1.9303873 0.6700385
#> 6 ATT(1930,1906)  1930 1906         NA        NA         NA        NA
#>   point.conf.low point.conf.high
#> 1     -0.5465512       1.6617624
#> 2     -1.6093007       1.3234481
#> 3     -1.4755328       0.9717783
#> 4     -1.2241156       0.8800145
#> 5     -1.9303873       0.6700385
#> 6             NA              NA

^{Created on 2021-03-30 by the reprex package (v1.0.0)}

bcallaway11 commented 3 years ago

Hi Matthieu,

Thanks for the message. I think your intuition is correct that, in your case, the warning messages are due to "too many covariates". What is happening here is a violation of the overlap condition, that for all treated units we can find "matching" untreated units with the same characteristics. For us, this often shows up when group sizes are small (hence the warning that you got), but your case seems to be a different sort of the same violation.

For the part about dealing with time-varying covariates, our code takes the value of the covariate in the "baseline period". In post treatment time periods, the baseline period is the one right before the group becomes treated. In pre-treatment periods, it is the period immediately before the current period.

Hope this helps!

Brant

MatthieuStigler commented 3 years ago

Great, thanks for the explanation! To be sure, this "overlap ciolation" should in theory only happen only with discrete/factor co variates, is that correct?

pedrohcgs commented 3 years ago

Not really. What is going on here is that, sometimes, either we have "too many covariates" such that one can not invert the design matrix. This can be true with discrete or continuous covariates. Other cases we can perfectly predict treatment, which is also problematic.

Hope this clarifies. Pedro

bcallaway11 / did

Understanding the "Not enough control units for group" #43