JMSLab / xtevent

Stata package -xtevent-
MIT License
43 stars 12 forks source link

Excluding units despite event time is defined #126

Closed Constantino-Carreto-Romero closed 1 year ago

Constantino-Carreto-Romero commented 1 year ago

The program excludes units if owing to missing values in the policyvar, it cannot determine the event-time. Nontheless, this is also happening even though the policyvar has no missing values and the event time is well defined.

Constantino-Carreto-Romero commented 1 year ago

In https://github.com/JMSLab/xtevent/commit/878a7af487fb780ae0dbf9393da9189649f17d79, I uploaded some code and its log file to help confirm the bug.

The bug arises because the program is not identifying treatment time because the dependent variable has missing values in the treatment period: the local touse, which marks the sample in _eventgenvars, should not be considered when creating the event-time dummies because due to missing values in the dependent variable some observations are excluded, but it turns out those observations, in the case of some units, are the periods when the event-time is defined (e.g., policyvar changes from 0 to 1).

Constantino-Carreto-Romero commented 1 year ago

@jorpppp In https://github.com/JMSLab/xtevent/commit/dd4e2c2a26107d9be37a3567bacf3e9837d8b9f6 I removed the sample marker touse. The inclusion of touse as a conditioner throughtout _eventgenvars was restricting the sample to non-missing observations in the variables in the saved local varlist, which in the example in https://github.com/JMSLab/xtevent/commit/878a7af487fb780ae0dbf9393da9189649f17d79 only included the dependent variable lnrealincome. This was hapenning despite of policyvar not having missing observations. Restricting in this way, was causing that for some units, the observations where their event-time is defined, were excluded. In consequence, the program couldn't determine their event-time period, and those observations were excluded for the estimation.

In https://github.com/JMSLab/xtevent/commit/3b317a699fc1dc3e9d07c1a4f0e339e62a57bf9d I replicated the example in https://github.com/JMSLab/xtevent/commit/878a7af487fb780ae0dbf9393da9189649f17d79 after the correction in https://github.com/JMSLab/xtevent/commit/dd4e2c2a26107d9be37a3567bacf3e9837d8b9f6. Now, no unit is excluded and as noticed in the table at the of the log file, in the units that previously were excluded, only their missing observations are excluded but not the whole unit.

Constantino-Carreto-Romero commented 1 year ago

@jorpppp I ran the test file with/without marksample in _eventgenvars and with missing values in policyvar (and eta, the dependent variable). This is a table to show execution times in each case:

missing values in policyvar missing values in eta marksample execution time
yes yes no 178.57
yes 172.55
yes no no 177.06
yes 176.92

In the case of missing values in eta, it is expected that the inclusion of marksample reduces time execution because it restrics the sample to non-missing observations in both the depedent and independent variables. The program is excluding more units in the case of missing values in eta because the event-time of some units is lost due to missings in eta. These are the log files: with marksample , without marksample.

When only considering missing values in the policyvar , execution times are similar because the inclusion/exclusion of marksample in _eventgenvars doesn't alter the sample because eta has no missing values. These are the log files: with marksample , without marksample. The log files show identical outputs in this case. I had to upload separate files because github didn't show the comparison window due to files' size.

jorpppp commented 1 year ago

@Constantino-Carreto-Romero Can you please remind me what is your recommended fix here? It seems like your fix excluding the marksample works, but I believe we were worried about speed without marksample. It doesn't seem like there are major differences in speed with and without marksample though. Do you agree?

Constantino-Carreto-Romero commented 1 year ago

@jorpppp in the call previous to the analysis in https://github.com/JMSLab/xtevent/issues/126#issuecomment-1358157589, we had agreed that removing the sample marker in _eventgenvars seemed the solution, but we were worried that the sample marker could also be affected by missing values in the policyvar, and not only by missing values in the variables included in the local varlist (the dependent variable and the independent variables). Thus, it was also necessary to compare the scenario where the dependent variable eta doesn't have missing values, but the policyvar does. The time executions in https://github.com/JMSLab/xtevent/issues/126#issuecomment-1358157589 confirm that the sample marker doesn't inspect for missing values in the policyvar. When we are in _eventgenvars we focus on only the policyvar, and not on the other variables, and since the sample marker doesn't inspect for missing values in the policyvar it seems right to remove the sample marker from _eventgenvars.

jorpppp commented 1 year ago

@Constantino-Carreto-Romero Got it, thanks, let me start a PR then.

jorpppp commented 1 year ago

@Constantino-Carreto-Romero Got it, thanks, let me start a PR then.

On second thought, it seems like the changes here are only in the issue folder, so let me test the changes in the issue folder first, then implement in the main code, then PR.

jorpppp commented 1 year ago

@Constantino-Carreto-Romero I thought about this and I don't think completely excluding marksample from _eventgenvars is the way to go. Besides opening up a path for bugs (not using marksample and touse is unusual), if we were to do this then the if in restrictions from xtevent would not be passed down to _eventgenvars.

My approach in https://github.com/JMSLab/xtevent/commit/a9fdee7246929d42b7bb61b2efb8d2b1ebbfd2e3 is to mark a sample for _eventgenvars that considers the if/in restrictions, but does not consider missings in varlist.

This is not entirely satisfactory though. I am thinking of an scenario where staggered adoption may not hold in the full sample, but it may hold in the sample where varlist is not missing. But in that case the user should probably choose the staggered adoption sample using if/in in xtevent.

Constantino-Carreto-Romero commented 1 year ago

@jorpppp I have explored the documentation of marksample and it is possible to specify marksample with the option novarlist. With this option, marksample will preserve the if and in restrictions, but will ignore the missing values in the variables contained in varlist. Should I implement marksample with the option novarlist and test it?

jorpppp commented 1 year ago

@Constantino-Carreto-Romero Would this approach have any advantages over the one I already implemented with mark ?

Constantino-Carreto-Romero commented 1 year ago

@jorpppp I think the approach with mark in https://github.com/JMSLab/xtevent/commit/a9fdee7246929d42b7bb61b2efb8d2b1ebbfd2e3 preserves the if and in restrictions within _eventgenvars, but missing values in varlist are still affecting the sample used within _eventgenvars because mark, as with marksample, is also affected by missing values in varlist, according to remark 5 in the documentation. The advantage of using marksample is that it admits the option novarlist.

jorpppp commented 1 year ago

Ok, let's go ahead and test the approach you suggest, and compare it to the one I implemented using mark

jorpppp commented 1 year ago

@Constantino-Carreto-Romero I don't think mark carries over the missings in varlist. See this example:

clear

sysuse auto
replace price =. in 1

cap program drop test1
program define test1
    syntax varlist [if] [in]
    marksample touse
    gen test1= `touse'
end

cap program drop test2
program define test2
    syntax varlist [if] [in]
    mark test2 `if' `in'    
end

test1 price if !foreign
test2 price if !foreign

li price foreign test* in 1/2

This returns:

image

So `mark' includes the observation where price is missing.

I'm going to fix a tempvar generation and then move this to a PR for testing, we can discuss there if any additional issues arise.

jorpppp commented 1 year ago

Summary: In this issue we fixed a bug where event-time variables were not generated if variables (not related to event-time) were missing. Thread continues in #139