bcallaway11 / did

Difference in Differences with Multiple Periods, website: https://bcallaway11.github.io/did
287 stars 91 forks source link

Error in mboot when clustering at two variables #175

Open cmjoyce opened 1 year ago

cmjoyce commented 1 year ago

Hi there,

I'm using the did package and need to account for clustering at the district level, which is different from my idname (individuals residing in these clusters). Based on the existing documentation, I've accounted for individual and district level clustering. The code and error message are as follows:

att_gt(yname = "outcome",
       tname = "year",
       gname = "g",
       idname = "id",
       xformla = ~ 1,
       data = df,
       panel = FALSE,
       weightsname = "weight_adj",
       clustervars = c("id", "dist_id"),
       control_group = "notyettreated",
       print_details = TRUE,
       bstrap=TRUE, cband=FALSE
)

Error in mboot(inffunc, DIDparams = dp, pl = pl, cores = cores) : 
  can't handle that many cluster variables

I've tried making a vector of these variables and using that as my clustervars, but that just errors out.

Is there a way to get around this error and account for both clustering variables?

Thanks very much, Caroline

pedrohcgs commented 1 year ago

Hi Caroline, Does the error remains when you just use dist_id as cluster variable?

Thanks

On Thu, May 18, 2023 at 15:49 cmjoyce @.***> wrote:

Hi there,

I'm using the did package and need to account for clustering at the district level, which is different from my idname (individuals residing in these clusters). Based on the existing documentation, I've accounted for individual and district level clustering. The code and error message are as follows:

att_gt(yname = "outcome", tname = "year", gname = "g", idname = "id", xformla = ~ 1, data = df, panel = FALSE, weightsname = "weight_adj", clustervars = c("id", "dist_id"), control_group = "notyettreated", print_details = TRUE, bstrap=TRUE, cband=FALSE ) Error in mboot(inffunc, DIDparams = dp, pl = pl, cores = cores) : can't handle that many cluster variables

I've tried making a vector of these variables and using that as my clustervars, but that just errors out.

Is there a way to get around this error and account for both clustering variables?

Thanks very much, Caroline

— Reply to this email directly, view it on GitHub https://github.com/bcallaway11/did/issues/175, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE7344GREUVLUUC4OMMRVTXG2DMXANCNFSM6AAAAAAYG6EXVM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Pedro H. C. Sant'Anna https://psantanna.com https://psantanna.com

cmjoyce commented 1 year ago

Hi thanks for the quick response!

There's no error when I use just dist_id as my clustering variable, though I end up with some very large (confusingly so) standard errors for some treatment groups-- especially if including individual-level covariates. But if clustering only on district gives correctly calculated standard errors I will assume the issue is on my end.

Caroline

bcallaway11 commented 10 months ago

@cmjoyce, sorry for the delayed response. I am surprised that you got an error with the first version that you sent. I am marking that as a bug as I think it should work.

That being said, by default, we already cluster at the unit level (in your case "id"), so clustering on both ends up being redundant. This is not a fix for the large standard errors, but they are the ones that I think you were trying to get from the beginning.

cmjoyce commented 10 months ago

Yes, I think my clustering on two variables was redundant -- I tweaked some things and got it working. I limited my clustering to one variable to avoid the error message. Thanks for the awesome package!

bcallaway11 commented 9 months ago

Ok, great!

Note to self: I am going to leave this open as I think this could be confusing for users. Need to think about what behavior should be if user provides includes "id" among the clustering variables.

kdjiffa commented 3 months ago

Hi, I am having a similar issue related to this post. I have balanced panel data where I want to cluster at group and time level. I am using the individual id variable in clustervars instead of the group variable as per the documentation. I have 3 time periods (years), 3,000 observations per period and 1,000 per group which amounts to 9,000 observations in total. Below is my code and error

csdid_out <- att_gt(yname = "Y2it", tname = "year", gname = "first.treat", idname = "id", est_method = "reg", data = data, panel = TRUE, clustervars = c("id", "year"), control_group = "notyettreated", bstrap = TRUE, cband = FALSE,
) Error in mboot(inffunc, DIDparams = dp, pl = pl, cores = cores) : can't handle time-varying cluster variables

I will appreciate any help on this.

pedrohcgs commented 3 months ago

Time should not be used as cluster in a DiD procedure with with fixed T.

You cant make inference with 3 observations…


Pedro H. C. Sant'Anna https://psantanna.com https://psantanna.com


Warning: This email may contain confidential or privileged information intended only for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please understand that any disclosure, copying, distribution, or use of the contents of this email is strictly prohibited.

On Wed, Apr 3, 2024 at 16:37 kdjiffa @.***> wrote:

Hi, I am having a similar issue related to this post. I have balanced panel data where I want to cluster at group and time level. I am using the individual id variable in clustervars instead of the group variable as per the documentation. I have 3 time periods (years), 3,000 observations per period and 1,000 per group which amounts to 9,000 observations in total. Below is my code and error

csdid_out <- att_gt(yname = "Y2it", tname = "year", gname = "first.treat", idname = "id", est_method = "reg", data = data, panel = TRUE, clustervars = c("id", "year"), control_group = "notyettreated", bstrap = TRUE, cband = FALSE, ) Error in mboot(inffunc, DIDparams = dp, pl = pl, cores = cores) : can't handle time-varying cluster variables

I will appreciate any help on this.

— Reply to this email directly, view it on GitHub https://github.com/bcallaway11/did/issues/175#issuecomment-2035536309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE7344G7JILX2DOCCOSHG3Y3RSBNAVCNFSM6AAAAAAYG6EXVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVGUZTMMZQHE . You are receiving this because you commented.Message ID: @.***>

kdjiffa commented 3 months ago

Thanks for your quick feedback. In fact, what I meant is group*period (intersection) level clustering. What is the best way to cluster at such level? Thanks

pedrohcgs commented 3 months ago

Just use the id.

Thanks


Pedro H. C. Sant'Anna https://psantanna.com https://psantanna.com


Warning: This email may contain confidential or privileged information intended only for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please understand that any disclosure, copying, distribution, or use of the contents of this email is strictly prohibited.

On Wed, Apr 3, 2024 at 19:34 kdjiffa @.***> wrote:

Thanks for your quick feedback. In fact, what I meant is group*period (intersection) level clustering. What is the best way to cluster at such level? Thanks

— Reply to this email directly, view it on GitHub https://github.com/bcallaway11/did/issues/175#issuecomment-2035807710, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE7344DTJ3RXYXW2I5LOVDY3SGXLAVCNFSM6AAAAAAYG6EXVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVHAYDONZRGA . You are receiving this because you commented.Message ID: @.***>

kdjiffa commented 3 months ago

Thanks