NickCH-K / did

A Stata package that acts as a wrapper for Callaway and Sant'anna's R did package
31 stars 13 forks source link

Group is a string variable #14

Closed terkelse closed 3 years ago

terkelse commented 3 years ago

Attempting att_gt (haven't gotten it to work yet) on an ever-simplifying version of my model. My most recent recurring error is: "Group is a string variable" in mkmat. My gvar is an int var, so I don't believe the error is what it suggests.

A possibly relevant detail is I kept running into an "Inf not found" error with call_return.ado before I added a line in the ado to bypass it (https://github.com/haghish/rcall/pull/17/files/1a97cbb7aa01c1b803a967952149655a6b4e554c#). However, a basic rcall regression is returning results.

Why might this Group string error arise?

NickCH-K commented 3 years ago

Can you post minimal code/data that will produce the errors? It will be very difficult for me to track the bug down otherwise.

My guess is that the regression output includes NAs in the results, and rcall is having trouble sending those results back to Stata. Is this correct? Or are you not getting any results to screen at all?

NickCH-K commented 3 years ago

Also, if your group var has a value label on it, try removing it and see if that works.

terkelse commented 3 years ago

Sample code: att_gt y time group x_1 x_2 x_3, idname(id) anticipation(2) biters(10) clustervars(id)

id, group, time are integers. y, x_1, x_2, x_3 are double. None contains missing values. None has value labels.

I suspect the same as you -- the output is NAs and stata isn't recognizing it. I just don't know why that would be or how to check if that is the case.

I am not getting any results, just the error: string variables not allowed in varlist; Group is a string variable r(109);

Other comments: I'm hitting rcall clear after each attempt since it tells me to redo didsetup otherwise, and I run into issues when I redo didsetup so I've learned to avoid it or just uninstall/reinstall everything when I can't.

terkelse commented 3 years ago

And in my blue R screen it tells me that there are 50 or more warnings before it crashes into the Group is a string variable error.

NickCH-K commented 3 years ago

When I run your code with the following DGP I get no error. That's telling me there's something going on in your data that doesn't fit.

clear

set obs 500

* Outcome and covariates
g y = rnormal()
g x_1 = rnormal()
g x_2 = rnormal()
g x_3 = rnormal()

* ID and time indicators
g id = floor((_n-1)/5)
g time = mod(_n-1, 5)

* First period treated should be the same for all obs in id
g group = floor(runiform()*5)
sort id time
replace group = group[_n-1] if id == id[_n-1]
* Untreated groups
replace group = 0 if id < 25

att_gt y time group x_1 x_2 x_3, idname(id) anticipation(2) biters(10) clustervars(id)

Some notes:

  1. Try rcall: summary(CS_Model). If you see the output, that means the model ran correctly but there was an error in the process of sending it back to Stata
  2. If it DOES work, try rcall: table[['Group']] to see the variable it seems to be having trouble with
  3. rcall has "sticky" errors sometimes where once an error pops up it will continue to pop up for everything you do even if there's not an actual error. rcall clear won't fix it, you have to restart Stata.
  4. Check the gvar to make sure every value is a valid time period or 0.
terkelse commented 3 years ago

This is promising. rcall: summary(CS_Model) returned a table with the Group and Time columns complete. Most groups have ATTs and se's, but the singular-N groups are straight NAs (makes sense). However, none have confidence bands.

NickCH-K commented 3 years ago

Great, so the model is running. You might want to drop those singletons, that might be the only thing gumming it up.

NickCH-K commented 3 years ago

I also just now updated to handle models with lots of group/times a little better. Try reinstalling did and see if that helps.

terkelse commented 3 years ago

Still hitting the Group is a sting error, even after the update. The CS_Model now contains confidence bands for those groups that have estimates. There are rows of NAs for those without estimates, sometimes for groups as large as 5 (though some singleton groups are reaping estimates). I'm upping my sample to see if I'm out of the woods with the large sample, but it's taking some time.

NickCH-K commented 3 years ago

Good luck! I don't think I can help any more without being able to reproduce the error myself. If you can share the data (or a subset) that produces the error I can look again.

terkelse commented 3 years ago

This last attempt gave me NAs for a group of N_g = 285. I think this is coming down to covariate selection, so I dropped all but one and was able to get results for all groups. Now I'm just hitting the matsize limit, which I think is fine. I'll upload a dummy dataset for you to troubleshoot.

Thanks for all your help, by the way. This code is awesome, and way faster than my attempts at running C&S through rcall.

terkelse commented 3 years ago

Here's a sample of N=2241, G=36, T=72. One group (44) is a singleton, but many will give NAs depending on which X's are included. Could be an overlap issue for some groups? If this is unavoidable, then the error in this case ("Group is a string") could be replaced with something along the lines of "choose better X's".

I'm going to pause on this for now.

samp.TXT

NickCH-K commented 3 years ago

It was an issue with the results table itself being too big to return, that's fixed now.

Fundamentally, Stata's just not going to like it whenever there's a results matrix with missing values, as there is in this data. So you'll get the results back, but they won't be in e(b) or e(V). Any follow-up aggte is likely to be a little shaky too.

terkelse commented 3 years ago

Some good news (?) is I'm returning to the Inf error again, but I haven't traced the error yet to see what I need to fix (likely in the call_return ado). And I am attempting to migrate the code to my school's hpcc to use MP, but the default rcall settings aren't working and I'm having to alter the rcall ado setting to find R first. Is there a reason why didsetup reinstalls rcall every time instead of checking if it's installed first?

terkelse commented 3 years ago

Ohp, my bad! The autoinstall confusion is a function of me using the "go" option. Disregard that last question

NickCH-K commented 3 years ago

Some good news (?) is I'm returning to the Inf error again, but I haven't traced the error yet to see what I need to fix (likely in the call_return ado). And I am attempting to migrate the code to my school's hpcc to use MP, but the default rcall settings aren't working and I'm having to alter the rcall ado setting to find R first. Is there a reason why didsetup reinstalls rcall every time instead of checking if it's installed first?

I wasn't getting the Inf error when I ran your data with the newest version, so there may be something else going on there. Do you see results with rcall: summary(CS_Model)? If so, that's about as good as you'll get anyway; Stata's not going to let you do a lot of postestimation if your results have improper values. You can get access to the results table by writing it to file with rcall: write.csv(table, 'filename.csv')

NickCH-K commented 3 years ago

Also, none of the calculations are actually done in Stata, they're all in R. So switching to MP might help you open a bigger data set or something, but it won't make any of the did estimations any faster.

terkelse commented 3 years ago

I resolved the Inf issue (it was my fault, a typo). Got everything loaded in the hpcc, and it's nicely spitting out the temp_table_toobig csv, but not storing in e() as you said. I am re-running and attempting to gen dynamic effects w/ aggte. Should the na_rm option take care of the aggte "shaky"ness?

NickCH-K commented 3 years ago

You'll definitely want to add na_rm if you have blank entries in your results. I'm not sure if it will fully handle the shakiness - I don't think in general the did estimator is really made to handle huuuge numbers of group/time combinations. If it looks like it ranbut you get nothing back you can again do rcall: write.csv(table, 'filename.csv')

terkelse commented 3 years ago

Is there any way of getting Ns reported with the table?

NickCH-K commented 3 years ago

If the estimation works properly and everything gets returned to Stata, N can be found in e(N). If not, rcall: CS_Model[['n']] will give the number of unique group IDs, but it doesn't store the actual number of observations.

terkelse commented 3 years ago

Not sure how I'm going to get aggte working, and I can't calculate on my own without the N. Right now I'm trying to drop when time - group > max_e before I run att_gt. If that doesn't work, I'll try dropping groups where N_g < # of controls + 5. And if that doesn't work, I'll just learn R.

terkelse commented 3 years ago

Nothing I've tried has avoided Missing values in estimation results; results have not been delivered to e() matrices. from att_gt and consequently (?) [1] "R failed to produce estimates, or rcall failed to return it to Stata." ATT not found from aggte, but I am getting an att_gt table automatically put into temp_table_toobig.csv.

NickCH-K commented 3 years ago

Yeah I'm not sure. Missing values in the estimation results will screw up aggte, and can't be delivered to the e() matrices. This seems like an issue of the estimation/function not being able to produce estimates for the data/model, not a problem with the results being passed back and forth from Stata to R. If you ran the model in R directly you'd be running into the same problems.

terkelse commented 3 years ago

I've pruned the data to avoid NAs in the output, but I'm still hitting the same errors. I hadn't noticed yet that my confidence bands are astronomical despite modest standard errors. Have you encountered this before?

For example: `  Group Time ATTgt SE SimultCI95_Bot SimultCI95_Top
1 14 2 -0.0071 0.004897 -5.3E+09 5.29E+09
2 14 3 0.02191 0.005858 -6.3E+09 6.33E+09
3 14 4 -0.00131 0.004725 -5.1E+09 5.1E+09
4 14 5 -0.00604 0.002862 -3.1E+09 3.09E+09
5 14 6 -0.02046 0.003583 -3.9E+09 3.87E+09
6 14 7 0.012281 0.003488 -3.8E+09 3.77E+09
7 14 8 -0.0067 0.004291 -4.6E+09 4.64E+09`
pedrohcgs commented 3 years ago

This happens because the simultaneous confidence bands are based on sup-t test statistics. Diving by something very close to zero blows the sup t-statistics. I've tried to control some of this in the package, but with tiny groups (or very little variation on the group), there is not much the package can do.

terkelse commented 3 years ago

Ah okay, so the zero-heavy y is the culprit. Not a ton I can do about that. I suspect the group issue has to do with little variation rather than size since the three I kicked out to avoid NAs had 2,619, 1,654, and 272 individual units. Would running again with the cband_no option help aggte find att_gt?

pedrohcgs commented 3 years ago

With these many groups that are relatively small, ATT(g,t)’s should not be reliable. You the aggte function should avoid several of these issues, I am assuming.

Setting cband_no will produce pointwise inference, which will “hide” the problem by using what I believe to be a much inferior inference procedure. You can do it, but I would personally never do that...

On Wed, Apr 21, 2021 at 17:32 terkelse @.***> wrote:

Ah okay, so the zero-heavy y is the culprit. Not a ton I can do about that. I suspect the group issue has to do with little variation rather than size since the three I kicked out to avoid NAs had 2,619, 1,654, and 272 individual units. Would running again with the cband_no option help aggte find att_gt?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/NickCH-K/did/issues/14#issuecomment-824405504, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE734YXEDN5UGFPU2AS2WTTJ5HBVANCNFSM4276UCNQ .

--

Pedro H. C. Sant'Anna Department of Economics Vanderbilt University 615-875-8448 (phone) @.*** https://pedrohcgs.github.io

terkelse commented 3 years ago

Oh wow, okay I was thinking these groups were not that small since they are large relative to other groups that do yield estimates for every t. For example, my smallest group contains 17 units and yields ATT_gts for all t's.

The lesser cbands at least appeared reasonable next to the standard errors, but I still can't get aggte to find my ATTs.

pedrohcgs commented 3 years ago

17 units in a group: we can’t really hope for a CLT to kick in in this setup, right?! How many covariates are you using, too?! All these issues you are getting smells like lack of overlap and small groups...

I like the fact that did gives these very wide bands, but it is a good reminder that we cant do too much in these challenging setups

On Wed, Apr 21, 2021 at 19:03 terkelse @.***> wrote:

Oh wow, okay I was thinking these groups were not that small since they are large relative to other groups that do yield estimates for every t. For example, my smallest group contains 17 units and yields ATT_gts for all t's.

The lesser cbands at least appeared reasonable next to the standard errors, but I still can't get aggte to find my ATTs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NickCH-K/did/issues/14#issuecomment-824441449, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE7343IZ4A25OYC7TQUUZDTJ5RVPANCNFSM4276UCNQ .

--

Pedro H. C. Sant'Anna Department of Economics Vanderbilt University 615-875-8448 (phone) @.*** https://pedrohcgs.github.io

terkelse commented 3 years ago

It does smell like lack of overlap, but that's why is puzzling to get estimates for a group of 17 but not in a group of 2619.

terkelse commented 3 years ago

An update: Increasing biters solved the confidence interval issue, so simple fix.

However, I still can't get aggte to work. Even using the example data and code, I get from att_gt:

             Length Class     Mode   
group          12   -none-    numeric
t              12   -none-    numeric
att            12   -none-    numeric
V_analytical    1   dgCMatrix S4     
se             12   -none-    numeric
c               1   -none-    numeric
inffunc      6000   dgCMatrix S4     
n               1   -none-    numeric
W               1   -none-    numeric
Wpval           1   -none-    numeric
aggte           0   -none-    NULL   
alp             1   -none-    numeric
DIDparams      26   DIDparams list   

Is aggte supposed to be NULL here? Because I then get [1] "R failed to produce estimates, or rcall failed to return it to Stata." ATT not found. Is it possibly an issue with Linux compatibility?

NickCH-K commented 3 years ago

If you're getting errors even with the example code, it's possible that rcall doesn't work properly on Linux - I haven't had a chance to test it myself.