bcallaway11 / did

Difference in Differences with Multiple Periods, website: https://bcallaway11.github.io/did
298 stars 95 forks source link

Parallel processing not working after #132 #133

Closed zachwarner closed 2 years ago

zachwarner commented 2 years ago

Hi. Thanks for the neat package -- just digging into it over the past few days. I installed the development branch via devtools::install_github("bcallaway11/did") after #132 yesterday.

I've got a tibble with 2.9m observations so I've been drawing samples to examine speed (1%, 2%, 5%, even 100%). However, no matter the size of the sample, I never get more than one R process running in Activity Monitor. To be clear: I imagine there might be some overhead to initialize the parallel bootstrapping, but I never see more than one process open up, even while monitoring continuously.

I'm on a 2019 Mac Pro running Big Sur 11.6.6. I've got a 3.2GHz 16-core with 192GB of RAM.

Here's the whole script:

library(did); library(sf); library(tidyverse)
setwd("path/to/wd")
set.seed(8675309) # hey jenny
df <- read_rds("/data/df.rds")
out <- att_gt(yname = "y", tname = "time", idname = "id", gname = "first_treated", data = df, pl = T, cores = 14)

Here's the output from sessionInfo().

R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6.6

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
[9] tidyverse_1.3.1 sf_1.0-3 did_2.2.0.901

loaded via a namespace (and not attached): [1] tidyselect_1.1.2 haven_2.4.3 carData_3.0-5 colorspace_2.0-3
[5] vctrs_0.4.1 generics_0.1.2 utf8_1.2.2 rlang_1.0.2
[9] e1071_1.7-9 ggpubr_0.4.0 pillar_1.7.0 withr_2.5.0
[13] glue_1.6.2 DBI_1.1.1 dbplyr_2.1.1 readxl_1.4.0
[17] modelr_0.1.8 lifecycle_1.0.1 cellranger_1.1.0 munsell_0.5.0
[21] ggsignif_0.6.3 gtable_0.3.0 rvest_1.0.1 tzdb_0.3.0
[25] class_7.3-19 fansi_1.0.3 broom_0.8.0 Rcpp_1.0.8.3
[29] KernSmooth_2.23-20 backports_1.4.1 scales_1.2.0 classInt_0.4-3
[33] BMisc_1.4.4 jsonlite_1.8.0 abind_1.4-5 fs_1.5.0
[37] hms_1.1.1 stringi_1.7.6 rstatix_0.7.0 grid_4.1.0
[41] cli_3.3.0 tools_4.1.0 magrittr_2.0.3 proxy_0.4-26
[45] crayon_1.5.1 car_3.0-13 pkgconfig_2.0.3 ellipsis_0.3.2
[49] xml2_1.3.3 data.table_1.14.2 reprex_2.0.1 lubridate_1.8.0
[53] rstudioapi_0.13 assertthat_0.2.1 httr_1.4.2 R6_2.5.1
[57] units_0.7-2 compiler_4.1.0

Please let me know if/how I can provide more info to help troubleshoot.

zachwarner commented 2 years ago

Sorry to draw you in as a bystander, but @kylebutts I wonder if you're aware what might be going on?

bcallaway11 commented 2 years ago

I’m not sure what should happen in terms of processes that you can view on Activity Monitor. Are you noticing changes in computation time? Let’s see if Kyle has any ideas; otherwise, I’ll try to do some digging on these as soon as I have some time.

Brant

zachwarner commented 2 years ago

Typically, I would do something like:

library(doSNOW); library(foreach); library(parallel) 
cl <- makeCluster(8)
registerDoSNOW(cl)
out <- foreach(i = 1:nrow(df), .combine = rbind) %dopar% {
     some_code(...)
 }
 stopCluster(cl) 

This approach would produce 1 process in Activity Monitor until some_code(...) starts, at which point 7 new processes would open up, for a total of 8. Speedup would depend on what some_code(...) is, but somewhere on the order of 25%-400%.

In this case, I see no additional processes open, so it appears that the parallelism never spins up. There's no speed difference between

out <- att_gt(yname = "y", tname = "time", idname = "id", gname = "first_treated", data = df, pl = T, cores = 14)

and

out <- att_gt(yname = "y", tname = "time", idname = "id", gname = "first_treated", data = df)

for any of the smaller samples. (Computation time seems to explode for the full sample, so I haven't actually been able to complete a successful run on the full 2.9m observations.)

Final thought: my coauthor just tested it on their Windows machine and got the same issue. I believe they'll post their session information sometime soon, but in any case, I don't think it's localized to my Mac.

kylebutts commented 2 years ago

Sorry, I messed up and wasn't passing cores to the mc.cores argument. You can set a global option for mc.cores in the parallel package which is why I didn't catch it before.

Note that for quite small sample sizes, I don't dispatch to parallel cause it's actually slower at small sample sizes as you note

zachwarner commented 2 years ago

Oh, haha, that'll do it! So to confirm, I should call library(parallel) and set the global mc.cores option? Or should I just hold off making any changes for now until there's a new merge?

Thanks both for taking a look so quickly!

kylebutts commented 2 years ago

You can do either! I submitted a fix: https://github.com/bcallaway11/did/pull/134

zachwarner commented 2 years ago

Great, thanks. Maybe I'm being daft but

library(parallel)
options(mc.cores=6)
mc.cores <- 6
out <- att_gt(yname = "y", tname = "time", idname = "id", 
              gname = "first_treated", data = df, pl = T, cores = mc.cores)

doesn't seem to change anything. I think I'll just wait until #134 is merged. @bcallaway11 please feel free to close this once you merge it.

bcallaway11 commented 2 years ago

Should be fixed now!