JMSLab / xtevent

Stata package -xtevent-
MIT License
43 stars 12 forks source link

Repeated use restricting to e(sample) leads to different results #203

Closed jorpppp closed 3 months ago

jorpppp commented 3 months ago

Consider this example

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) impute(nuchange)

gen s = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) impute(nuchange)

The second run over the estimation sample from the first run leads to different results. This probably has to do with the impute option.

I am going to label this as a bug for now because it looks like it.

jorpppp commented 3 months ago

Note that repeated saving of imputed variables with different suffixes is not working:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) impute(nuchange, saveimp)

ren union2_imputed union2_imputed1
gen s = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) impute(nuchange, saveimp)

returns an error message.

This is because we are confirming vars with unab, so I'm going to fix that first.

jorpppp commented 3 months ago

@Constantino-Carreto-Romero note the issue above https://github.com/JMSLab/xtevent/issues/203#issuecomment-2111160274 and the fix in https://github.com/JMSLab/xtevent/commit/548b1a6119e73ed0977ecb99928c0d8854816d2b. I think this is also an issue for #177 that we just closed, because it means that when xtevent drops variables to replace them, it may be dropping more variables that it should.

Can you please open a new issue to check all confirms to make sure this doesn't happen? Note that for the fix in #177 you may need to move the check for replacing to a later place in the code, when the code already knows the new kvars and can confirm them individually.

jorpppp commented 3 months ago

After https://github.com/JMSLab/xtevent/commit/548b1a6119e73ed0977ecb99928c0d8854816d2b the above code works: https://github.com/JMSLab/xtevent/issues/203#issuecomment-2111160274

and it shows that the imputed variable is different across runs, so the issue has to do with the imputation.

jorpppp commented 3 months ago

Actually the issue may be deeper since it shows up without any imputation:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode)
est sto m1

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) 
est sto m2

est tab m1 m2
jorpppp commented 3 months ago

This is not an imputation issue, and I don't think this is a bug. This code checks the estimation samples for the two runs:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) savek(a)
est sto m1

* ren union2_imputed union2_imputed1
gen s1 = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s1 ///
, pol(union2) w(3) cluster(idcode) 
est sto m2

gen s2 = e(sample)

* ren union2_imputed union2_imputed2

* assert union2_imputed1 == union2_imputed2

est tab m1 m2

tab s1 s2
tab idcode if s1 != s2

This shows that the samples differ between runs and it lists some ids for where this happens. These ids are the same that are excluded in the second run because of ambiguous event times.

image

After that if we check the sample for missing values for one of the ids:

li idcode time union2 tenure age ttl_exp s1 s2 if idcode == 13

we get

image

Note that tenure is missing precisely when the policy variable, union2, changes from zero to one. So in the first run, this observation is excluded because of missing tenure. Now, the second run that is restricted to the estimation sample of the first run. Since that observation is not in the sample for the second run, the observations for id=13 have ambiguous event times so the entire unit is excluded. The same happens for the other ids in the sample.

The same happens with the other ids with the sample.

I'll now check the imputation case.

jorpppp commented 3 months ago

It's a good idea to return the list of excluded units due to ambiguous event times so they can be checked. I added this in https://github.com/JMSLab/xtevent/commit/3a84461121c8f2723d551ab3138cd9205779b3ea and https://github.com/JMSLab/xtevent/commit/7afc04c2c63ff17cfdf53950e0e2f3ac68708815

After doing this I checked if the excluded units due to ambiguous event time when using impute(nuchange) (a larger set than without imputation) had the same issue of missing tenure when the policy activates, and they do. So the issue is the same overall.

In summary I think this is not a bug.

@Constantino-Carreto-Romero could you read the posts on this issue and let me know if you agree? If so we can start a PR to add these changes.

Constantino-Carreto-Romero commented 3 months ago

It's a good idea to return the list of excluded units due to ambiguous event times so they can be checked. I added this in 3a84461 and 7afc04c

After doing this I checked if the excluded units due to ambiguous event time when using impute(nuchange) (a larger set than without imputation) had the same issue of missing tenure when the policy activates, and they do. So the issue is the same overall.

In summary I think this is not a bug.

@Constantino-Carreto-Romero could you read the posts on this issue and let me know if you agree? If so we can start a PR to add these changes.

@jorpppp I replicated the examples throughout the issue and I agree it should not be considered as a bug. In the first run, to create the event-time dummies, the program focuses on a larger sample, but then estimates restricting to non-missing observations in the varlist. If the user wants to reestimate using e(sample) he/she might not be able to run the same estimation because the program won’t be able to define event time for some units that lost key observations in the previous run. Maybe we could add a second sample function, one that focuses on the sample used when creating the event-time dummies.

jorpppp commented 3 months ago

@Constantino-Carreto-Romero Yes, this makes me think that we just need to set the sample to the one used when creating the event-time dummies. That way the results will replicate using e(sample), and at the end of the day, that is the sample that xtevent is using as a whole, even if _eventols or _eventiv use smaller samples. I'll implement.

jorpppp commented 3 months ago

Summary: In this issue we addressed an issue with xtevent giving different results when ran twice, using the e(sample) from the first run in the second run. Thread continues in #205