JMSLab / eventstudyr

Other
24 stars 1 forks source link

Switch to data.table throughout #51

Open zhizhongpu opened 4 months ago

zhizhongpu commented 4 months ago

Goal: revise the package so that it uses data.table as opposed to data.frame consistently, thereby increasing consistency, readability, speed, and address the following issue:

zhizhongpu commented 1 week ago

@santiagohermo I have a question: currently we don't seem to enforce uniqueness of the keys (i.e. unitid interacted with timeid).

For example, in this code:

https://github.com/JMSLab/eventstudyr/blob/fd4fccb1cbf8f1d12f9b6d3605ba26b96eaece4f/tests/testthat/test-EventStudyOLS.R#L19-L23

which yields estimates of

> reg
              Estimate Std. Error    t value   Pr(>|t|)     CI Lower    CI Upper DF
z_lead3     0.16414209 0.14246519  1.1521557 0.25484743 -0.122152433 0.450436614 49
z_fd_lead3  0.11574829 0.16199814  0.7145038 0.47830594 -0.209799169 0.441295749 49
z_fd_lead2  0.07705731 0.16604537  0.4640738 0.64465008 -0.256623351 0.410737966 49
z_fd        0.07743044 0.12909231  0.5998067 0.55139884 -0.181990282 0.336851153 49
z_fd_lag1   0.31672930 0.17591476  1.8004703 0.07794326 -0.036784651 0.670243250 49
z_fd_lag2   0.29940208 0.15292494  1.9578368 0.05595634 -0.007912103 0.606716254 49
z_lag3      0.21211653 0.13096851  1.6195994 0.11173748 -0.051074556 0.475307608 49
x_r        -0.05229013 0.02915997 -1.7932163 0.07911138 -0.110889281 0.006309021 49

If we instead use the code below to add an extra row of (i,t) = (1,4):

    cluster <- TRUE

    df_test_EventStudyOLS |> setDT()
    extra_row = df_test_EventStudyOLS[4]
    df_test_EventStudyOLS <- rbind(df_test_EventStudyOLS, extra_row)

    if (FE & TFE & cluster) {

then we'd get different estimates

reg Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF z_lead3 0.16428300 0.14243445 1.1533937 0.25434404 -0.121949744 0.450515749 49 z_fd_lead3 0.11584767 0.16201551 0.7150406 0.47797716 -0.209734692 0.441430025 49 z_fd_lead2 0.07686159 0.16602474 0.4629526 0.64544789 -0.256777611 0.410500789 49 z_fd 0.07735582 0.12909922 0.5991967 0.55180222 -0.182078763 0.336790409 49 z_fd_lag1 0.31693785 0.17585949 1.8022221 0.07766335 -0.036465023 0.670340722 49 z_fd_lag2 0.29965314 0.15288305 1.9600155 0.05569375 -0.007576851 0.606883136 49 z_lag3 0.21271065 0.13087191 1.6253347 0.11050686 -0.050286298 0.475707605 49 x_r -0.05256495 0.02905853 -1.8089335 0.07659875 -0.110960261 0.005830354 49

which is different. Note that the code still ran and no warning was issued. Also note that this latter result is hard to interpret. So I'm wondering if the decision to not ban/warn about duplicated ids was intentional. Asking because in #51 I have a chance to enforce key uniqueness.

santiagohermo commented 6 days ago

Thanks @zhizhongpu !

You are correct. When pre-cleaning the dataset passed we check whether it is unbalanced and whether holes in the time dimension exist, but we never enforce uniqueness in idvar and timevar. I agree, we should at a minimum throw a warning here.

Some additional thoughts:

zhizhongpu commented 3 days ago

thanks @santiagohermo!

  • Adding weights to a fixest call is super easy, we should definitely extend EventStudy to allow for this.

Yes - I can open a new issue once we merge #42

  • If we allow for weights we could ban datasets with duplicated rows. I think I would lean towards simply issuing a warning though, since some users may want to do this on purpose.

done in 58ee55453c26d61b7c103cfc2581e2efb5e15bef

There's no blocker on my end so I will move forward

santiagohermo commented 3 days ago

Thanks @zhizhongpu! Sounds great, let me know if I can be of any more help now. Happy to review after you are done!

zhizhongpu commented 3 days ago

@santiagohermo thank you.

  1. I raised some questions in a054ed428178c7b3bf48f01204c3d930ac85bd7a - nothing serious, - just for my understanding of this package
  2. The current state should be ready for your review. Pending point 3
    • Note that all 264 tests have passed
    • Feel free to open a PR if the timing looks right (or I'll do it)
  3. I noted this bug in the issue description https://github.com/JMSLab/eventstudyr/issues/51#issue-2380808732 (I also built a minimal reproducible example in c4e20e51258c32fb24e36d00b0081290c2f8ace6. I'd like to know if you have a preference on whether we fix it here vs in a new issue.
    • In case your decision depends on the solution we'd adopt:
    • Right now the most straightforward solution I can come up with is to make an internal copy of the input data in the early stage of EventStudy() function, instead of operating upon the input data. This could look like:
      data = data.table::as.data.table(data_raw, key = c(idvar, timevar))

      instead of what's currently there

      data.table::setDT(data, key = c(idvar, timevar))
    • making a copy could mean some more time for large datasets, but it has the benefit of insulating the input data from ANY unwanted changes.
    • As I wait for your input, I'll look for more elegant solutions.

Addition edit: In a24198519be87ee1800b53e149b4820118099e07 I benchmarked a few solutions:

  1. Original code taking into a data.frame object
  2. New code, but making a deep copy to address the bug
  3. New code, but making a shallow copy to address the bug
  4. New code

Overall the shallow copy solution seems best in that it both resolves the bug and avoids significant overhead as seen in deep copying. The disadvantage is some stability concern (see, for starter, https://github.com/Rdatatable/data.table/issues/3665): there seems to be plans to adjust data.table's copying functions and support for a more formal shallow() copy function, which one day might supersede the hotfix solution adopted here (copy = data[T,]). However, I'm not sure how much a threat this is, given that there's been no progress / sign of change in the past 5 years. If we adopt this solution, we might just need to make some small adjustment one day when data.table finally formalizes a solution.