Would chaining data.table expressions make it faster?

nanxstats commented 9 months ago

Looks like chaining data.table operations could make things slower in our case:

library(simtrial)
library(data.table)

x <- sim_pw_surv(
  n = 200,
  stratum = data.frame(
    stratum = c("Positive", "Negative"),
    p = c(.5, .5)
  ),
  fail_rate = data.frame(
    stratum = rep(c("Positive", "Negative"), 2),
    period = rep(1, 4),
    treatment = c(rep("control", 2), rep("experimental", 2)),
    duration = rep(1, 4),
    rate = log(2) / c(6, 9, 9, 12)
  ),
  dropout_rate = data.frame(
    stratum = rep(c("Positive", "Negative"), 2),
    period = rep(1, 4),
    treatment = c(rep("control", 2), rep("experimental", 2)),
    duration = rep(1, 4),
    rate = rep(.001, 4)
  )
) |> dplyr::filter(stratum == "Positive")

get_cut_date_by_event_ <- function(x, event) {
  as.data.table(x)[
    fail == 1
  ][
    , .(cte)
  ][
    order(cte)
  ][
    , eventCount := frankv(cte, ties.method = "first")
  ][
    eventCount <= event
  ][
    , last(cte)
  ]
}

microbenchmark::microbenchmark(
  get_cut_date_by_event(x, event = 50),
  get_cut_date_by_event_(x, event = 50),
  times = 5000
)
#> Unit: microseconds
#>                                   expr     min      lq     mean   median      uq      max
#>   get_cut_date_by_event(x, event = 50) 720.903 762.190 848.5596 777.6470 798.926 43777.46
#>  get_cut_date_by_event_(x, event = 50) 802.534 848.659 967.0310 866.9245 890.725 82661.25

jdblischak commented 9 months ago

I'm surprised by this. I would have guessed that chaining versus not-chaining would have a negligible effect on runtime.

I thought the difference might be due to the fact that get_cut_date_by_event_() is defined in the global environment. First I tried byte-compiling it manually, but it was still slower:

get_cut_date_by_event_byte <- compiler::cmpfun(get_cut_date_by_event_)

microbenchmark::microbenchmark(
  get_cut_date_by_event(x, event = 50),
  get_cut_date_by_event_(x, event = 50),
  get_cut_date_by_event_byte(x, event = 50),
  times = 5000
)
## Unit: milliseconds
##                                       expr    min      lq     mean  median      uq      max neval
##       get_cut_date_by_event(x, event = 50) 1.8973 2.08960 2.773146 2.28710 2.71710  30.3287  5000
##      get_cut_date_by_event_(x, event = 50) 2.1048 2.31790 3.169816 2.52965 2.97875 227.4689  5000
##  get_cut_date_by_event_byte(x, event = 50) 2.1124 2.32055 3.063390 2.54195 3.01070  50.5464  5000

Then I copied the function into the package and re-installed, but again, that didn't help:

## Unit: milliseconds
##                                                 expr    min      lq     mean  median     uq      max neval
##                 get_cut_date_by_event(x, event = 50) 1.8870 2.10420 2.845605 2.30050 2.7207 226.5604  5000
##                get_cut_date_by_event_(x, event = 50) 2.1079 2.34270 3.156965 2.56320 3.0143  46.9350  5000
##            get_cut_date_by_event_byte(x, event = 50) 2.1010 2.34115 3.124938 2.55915 3.0302 152.0483  5000
##  simtrial:::get_cut_date_by_event_pkg(x, event = 50) 2.1662 2.42465 3.248398 2.65425 3.1637  97.4980  5000

nanxstats commented 9 months ago

I'm surprised by this. I would have guessed that chaining versus not-chaining would have a negligible effect on runtime.

I thought the difference might be due to the fact that get_cut_date_by_event_() is defined in the global environment. First I tried byte-compiling it manually, but it was still slower

Very nice comparisons! I think you basically ruled out everything obvious or even a bit obscure that could play a role. We might have to accept that chaining is indeed slower in our case here 😂 So I'm close this for now.

Merck / simtrial

Would chaining data.table expressions make it faster? #184