Switch to data.table throughout

zhizhongpu commented 5 months ago

Goal: revise the package so that it uses data.table as opposed to data.frame consistently, thereby increasing consistency, readability, speed, and address the following issue:

currently, if a data.table is passed into the function, an uninvited *_fd column gets appended to the data.table and a warning is raised upon passing the data.table into EventStudy a second time (or more):

In EventStudy(estimator = "OLS", data = merged, outcomevar = y,  :
Variables starting with gdp_ln_fd should be reserved for usage by eventstudyr.
In EventStudy(estimator = "OLS", data = merged, outcomevar = y,  :
Variables starting with gdp_ln_lead should be reserved for usage by eventstudyr.
In EventStudy(estimator = "OLS", data = merged, outcomevar = y,  :
Variables starting with gdp_ln_lag should be reserved for usage by eventstudyr.

zhizhongpu commented 1 month ago

@santiagohermo I have a question: currently we don't seem to enforce uniqueness of the keys (i.e. unitid interacted with timeid).

For example, in this code:

https://github.com/JMSLab/eventstudyr/blob/fd4fccb1cbf8f1d12f9b6d3605ba26b96eaece4f/tests/testthat/test-EventStudyOLS.R#L19-L23

which yields estimates of

> reg
              Estimate Std. Error    t value   Pr(>|t|)     CI Lower    CI Upper DF
z_lead3     0.16414209 0.14246519  1.1521557 0.25484743 -0.122152433 0.450436614 49
z_fd_lead3  0.11574829 0.16199814  0.7145038 0.47830594 -0.209799169 0.441295749 49
z_fd_lead2  0.07705731 0.16604537  0.4640738 0.64465008 -0.256623351 0.410737966 49
z_fd        0.07743044 0.12909231  0.5998067 0.55139884 -0.181990282 0.336851153 49
z_fd_lag1   0.31672930 0.17591476  1.8004703 0.07794326 -0.036784651 0.670243250 49
z_fd_lag2   0.29940208 0.15292494  1.9578368 0.05595634 -0.007912103 0.606716254 49
z_lag3      0.21211653 0.13096851  1.6195994 0.11173748 -0.051074556 0.475307608 49
x_r        -0.05229013 0.02915997 -1.7932163 0.07911138 -0.110889281 0.006309021 49

If we instead use the code below to add an extra row of (i,t) = (1,4):

    cluster <- TRUE

    df_test_EventStudyOLS |> setDT()
    extra_row = df_test_EventStudyOLS[4]
    df_test_EventStudyOLS <- rbind(df_test_EventStudyOLS, extra_row)

    if (FE & TFE & cluster) {

then we'd get different estimates

reg Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF z_lead3 0.16428300 0.14243445 1.1533937 0.25434404 -0.121949744 0.450515749 49 z_fd_lead3 0.11584767 0.16201551 0.7150406 0.47797716 -0.209734692 0.441430025 49 z_fd_lead2 0.07686159 0.16602474 0.4629526 0.64544789 -0.256777611 0.410500789 49 z_fd 0.07735582 0.12909922 0.5991967 0.55180222 -0.182078763 0.336790409 49 z_fd_lag1 0.31693785 0.17585949 1.8022221 0.07766335 -0.036465023 0.670340722 49 z_fd_lag2 0.29965314 0.15288305 1.9600155 0.05569375 -0.007576851 0.606883136 49 z_lag3 0.21271065 0.13087191 1.6253347 0.11050686 -0.050286298 0.475707605 49 x_r -0.05256495 0.02905853 -1.8089335 0.07659875 -0.110960261 0.005830354 49

which is different. Note that the code still ran and no warning was issued. Also note that this latter result is hard to interpret. So I'm wondering if the decision to not ban/warn about duplicated ids was intentional. Asking because in #51 I have a chance to enforce key uniqueness.

santiagohermo commented 1 month ago

Thanks @zhizhongpu !

You are correct. When pre-cleaning the dataset passed we check whether it is unbalanced and whether holes in the time dimension exist, but we never enforce uniqueness in idvar and timevar. I agree, we should at a minimum throw a warning here.

Some additional thoughts:

My understanding is that that the output of the model with the extra row can be interpreted as a weighted regression where that row has a weight of "2" and all other rows a weight of "1". There might be valid use cases where someone would pass the same row multiple times as a way of weighting the data.
Adding weights to a fixest call is super easy, we should definitely extend EventStudy to allow for this.
If we allow for weights we could ban datasets with duplicated rows. I think I would lean towards simply issuing a warning though, since some users may want to do this on purpose.

zhizhongpu commented 4 weeks ago

thanks @santiagohermo!

Adding weights to a fixest call is super easy, we should definitely extend EventStudy to allow for this.

Yes - I can open a new issue once we merge #42

If we allow for weights we could ban datasets with duplicated rows. I think I would lean towards simply issuing a warning though, since some users may want to do this on purpose.

done in 58ee55453c26d61b7c103cfc2581e2efb5e15bef

There's no blocker on my end so I will move forward

santiagohermo commented 4 weeks ago

Thanks @zhizhongpu! Sounds great, let me know if I can be of any more help now. Happy to review after you are done!

zhizhongpu commented 4 weeks ago

@santiagohermo thank you.

I raised some questions in a054ed428178c7b3bf48f01204c3d930ac85bd7a - nothing serious, - just for my understanding of this package
The current state should be ready for your review. Pending point 3
- Note that all 264 tests have passed
- Feel free to open a PR if the timing looks right (or I'll do it)
I noted this bug in the issue description https://github.com/JMSLab/eventstudyr/issues/51#issue-2380808732 (I also built a minimal reproducible example in c4e20e51258c32fb24e36d00b0081290c2f8ace6. I'd like to know if you have a preference on whether we fix it here vs in a new issue.
- In case your decision depends on the solution we'd adopt:
- Right now the most straightforward solution I can come up with is to make an internal copy of the input data in the early stage of EventStudy() function, instead of operating upon the input data. This could look like:
```
data = data.table::as.data.table(data_raw, key = c(idvar, timevar))
```
  instead of what's currently there
```
data.table::setDT(data, key = c(idvar, timevar))
```
- making a copy could mean some more time for large datasets, but it has the benefit of insulating the input data from ANY unwanted changes.
- As I wait for your input, I'll look for more elegant solutions.

Addition edit: In a24198519be87ee1800b53e149b4820118099e07 I benchmarked a few solutions:

Original code taking into a data.frame object
New code, but making a deep copy to address the bug
New code, but making a shallow copy to address the bug
New code

Overall the shallow copy solution seems best in that it both resolves the bug and avoids significant overhead as seen in deep copying. The disadvantage is some stability concern (see, for starter, https://github.com/Rdatatable/data.table/issues/3665): there seems to be plans to adjust data.table's copying functions and support for a more formal shallow() copy function, which one day might supersede the hotfix solution adopted here (copy = data[T,]). However, I'm not sure how much a threat this is, given that there's been no progress / sign of change in the past 5 years. If we adopt this solution, we might just need to make some small adjustment one day when data.table finally formalizes a solution.

santiagohermo commented 3 weeks ago

Thanks @zhizhongpu! Some replies:

I replied in a054ed4
Could you open the PR with me as reviewer? I'll do a better job looking at the changes there.
Great catch with this problem! I guess we were using dataframes as input, so we never noticed it. Some reactions:
- Sounds like an easy fix for a data.table-related issue, so I think solving it here sounds fine.
- Making an internal shallow copy sounds great to me. We can also have an option copy_data or something, which defaults to TRUE, so that we allow users to avoid the copying altogether if they are fine with the potential consequences.
- We can have a test that will fail if this strategy for the shallow copy stops working in the future.
- We can also have a test where we input different types of dataset objects into EventStudy and check that the original dataset is modified (or not) as intended

JMSLab / eventstudyr

Switch to data.table throughout #51