Open zhizhongpu opened 4 months ago
@santiagohermo I have a question: currently we don't seem to enforce uniqueness of the keys (i.e. unitid interacted with timeid).
For example, in this code:
which yields estimates of
> reg
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
z_lead3 0.16414209 0.14246519 1.1521557 0.25484743 -0.122152433 0.450436614 49
z_fd_lead3 0.11574829 0.16199814 0.7145038 0.47830594 -0.209799169 0.441295749 49
z_fd_lead2 0.07705731 0.16604537 0.4640738 0.64465008 -0.256623351 0.410737966 49
z_fd 0.07743044 0.12909231 0.5998067 0.55139884 -0.181990282 0.336851153 49
z_fd_lag1 0.31672930 0.17591476 1.8004703 0.07794326 -0.036784651 0.670243250 49
z_fd_lag2 0.29940208 0.15292494 1.9578368 0.05595634 -0.007912103 0.606716254 49
z_lag3 0.21211653 0.13096851 1.6195994 0.11173748 -0.051074556 0.475307608 49
x_r -0.05229013 0.02915997 -1.7932163 0.07911138 -0.110889281 0.006309021 49
If we instead use the code below to add an extra row of (i,t) = (1,4):
cluster <- TRUE
df_test_EventStudyOLS |> setDT()
extra_row = df_test_EventStudyOLS[4]
df_test_EventStudyOLS <- rbind(df_test_EventStudyOLS, extra_row)
if (FE & TFE & cluster) {
then we'd get different estimates
reg Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF z_lead3 0.16428300 0.14243445 1.1533937 0.25434404 -0.121949744 0.450515749 49 z_fd_lead3 0.11584767 0.16201551 0.7150406 0.47797716 -0.209734692 0.441430025 49 z_fd_lead2 0.07686159 0.16602474 0.4629526 0.64544789 -0.256777611 0.410500789 49 z_fd 0.07735582 0.12909922 0.5991967 0.55180222 -0.182078763 0.336790409 49 z_fd_lag1 0.31693785 0.17585949 1.8022221 0.07766335 -0.036465023 0.670340722 49 z_fd_lag2 0.29965314 0.15288305 1.9600155 0.05569375 -0.007576851 0.606883136 49 z_lag3 0.21271065 0.13087191 1.6253347 0.11050686 -0.050286298 0.475707605 49 x_r -0.05256495 0.02905853 -1.8089335 0.07659875 -0.110960261 0.005830354 49
which is different. Note that the code still ran and no warning was issued. Also note that this latter result is hard to interpret. So I'm wondering if the decision to not ban/warn about duplicated ids was intentional. Asking because in #51 I have a chance to enforce key uniqueness.
Thanks @zhizhongpu !
You are correct. When pre-cleaning the dataset passed we check whether it is unbalanced and whether holes in the time dimension exist, but we never enforce uniqueness in idvar
and timevar
. I agree, we should at a minimum throw a warning here.
Some additional thoughts:
EventStudy
to allow for this.thanks @santiagohermo!
- Adding weights to a fixest call is super easy, we should definitely extend
EventStudy
to allow for this.
Yes - I can open a new issue once we merge #42
- If we allow for weights we could ban datasets with duplicated rows. I think I would lean towards simply issuing a warning though, since some users may want to do this on purpose.
done in 58ee55453c26d61b7c103cfc2581e2efb5e15bef
There's no blocker on my end so I will move forward
Thanks @zhizhongpu! Sounds great, let me know if I can be of any more help now. Happy to review after you are done!
@santiagohermo thank you.
EventStudy()
function, instead of operating upon the input data. This could look like:
data = data.table::as.data.table(data_raw, key = c(idvar, timevar))
instead of what's currently there
data.table::setDT(data, key = c(idvar, timevar))
Addition edit: In a24198519be87ee1800b53e149b4820118099e07 I benchmarked a few solutions:
Overall the shallow copy solution seems best in that it both resolves the bug and avoids significant overhead as seen in deep copying. The disadvantage is some stability concern (see, for starter, https://github.com/Rdatatable/data.table/issues/3665): there seems to be plans to adjust data.table's copying functions and support for a more formal shallow()
copy function, which one day might supersede the hotfix solution adopted here (copy = data[T,]
). However, I'm not sure how much a threat this is, given that there's been no progress / sign of change in the past 5 years. If we adopt this solution, we might just need to make some small adjustment one day when data.table
finally formalizes a solution.
Goal: revise the package so that it uses data.table as opposed to data.frame consistently, thereby increasing consistency, readability, speed, and address the following issue:
*_fd
column gets appended to the data.table and a warning is raised upon passing the data.table intoEventStudy
a second time (or more):