Closed zachwarner closed 2 years ago
Sorry to draw you in as a bystander, but @kylebutts I wonder if you're aware what might be going on?
I’m not sure what should happen in terms of processes that you can view on Activity Monitor. Are you noticing changes in computation time? Let’s see if Kyle has any ideas; otherwise, I’ll try to do some digging on these as soon as I have some time.
Brant
Typically, I would do something like:
library(doSNOW); library(foreach); library(parallel)
cl <- makeCluster(8)
registerDoSNOW(cl)
out <- foreach(i = 1:nrow(df), .combine = rbind) %dopar% {
some_code(...)
}
stopCluster(cl)
This approach would produce 1 process in Activity Monitor until some_code(...)
starts, at which point 7 new processes would open up, for a total of 8. Speedup would depend on what some_code(...)
is, but somewhere on the order of 25%-400%.
In this case, I see no additional processes open, so it appears that the parallelism never spins up. There's no speed difference between
out <- att_gt(yname = "y", tname = "time", idname = "id", gname = "first_treated", data = df, pl = T, cores = 14)
and
out <- att_gt(yname = "y", tname = "time", idname = "id", gname = "first_treated", data = df)
for any of the smaller samples. (Computation time seems to explode for the full sample, so I haven't actually been able to complete a successful run on the full 2.9m observations.)
Final thought: my coauthor just tested it on their Windows machine and got the same issue. I believe they'll post their session information sometime soon, but in any case, I don't think it's localized to my Mac.
Sorry, I messed up and wasn't passing cores
to the mc.cores
argument. You can set a global option for mc.cores
in the parallel package which is why I didn't catch it before.
Note that for quite small sample sizes, I don't dispatch to parallel cause it's actually slower at small sample sizes as you note
Oh, haha, that'll do it! So to confirm, I should call library(parallel)
and set the global mc.cores
option? Or should I just hold off making any changes for now until there's a new merge?
Thanks both for taking a look so quickly!
You can do either! I submitted a fix: https://github.com/bcallaway11/did/pull/134
Great, thanks. Maybe I'm being daft but
library(parallel)
options(mc.cores=6)
mc.cores <- 6
out <- att_gt(yname = "y", tname = "time", idname = "id",
gname = "first_treated", data = df, pl = T, cores = mc.cores)
doesn't seem to change anything. I think I'll just wait until #134 is merged. @bcallaway11 please feel free to close this once you merge it.
Should be fixed now!
Hi. Thanks for the neat package -- just digging into it over the past few days. I installed the development branch via
devtools::install_github("bcallaway11/did")
after #132 yesterday.I've got a tibble with 2.9m observations so I've been drawing samples to examine speed (1%, 2%, 5%, even 100%). However, no matter the size of the sample, I never get more than one R process running in Activity Monitor. To be clear: I imagine there might be some overhead to initialize the parallel bootstrapping, but I never see more than one process open up, even while monitoring continuously.
I'm on a 2019 Mac Pro running Big Sur 11.6.6. I've got a 3.2GHz 16-core with 192GB of RAM.
Here's the whole script:
Here's the output from
sessionInfo()
.Please let me know if/how I can provide more info to help troubleshoot.