Closed HenrikBengtsson closed 2 years ago
Thanks a lot for your advice!
I created #12 to test your proposed changes. I replaced multisession
with multicore
to use forking, but the execution now seems to be sequential. Do you have an idea why that might be the case?
Thanks a lot for your advice!
I created #12 to test your proposed changes. I replaced
multisession
withmulticore
to use forking, but the execution now seems to be sequential. Do you have an idea why that might be the case?
Are you running in RStudio? If so, see Section 'Support for forked ("multicore") processing' in https://future.futureverse.org/reference/multicore.html and follow the link to https://parallelly.futureverse.org/reference/supportsMulticore.html for how to re-enable it, e.g. temporarily just for your function. But be careful to read up on the details for why it's disabled by default.
That's a great point, thanks a lot for clearing it up. I was indeed using RStudio. Running the same code from the terminal works perfectly.
I now also replaced foreach::%dopar%
with doRNG::%dorng%
.
Do you have any other suggestions? Otherwise I'd merge the PR.
That's a great point, thanks a lot for clearing it up. I was indeed using RStudio. Running the same code from the terminal works perfectly.
:+1:
I now also replaced
foreach::%dopar%
withdoRNG::%dorng%
.
Oh, great. Happy to hear it helped to catch a potential RNG issue.
Do you have any other suggestions? Otherwise I'd merge the PR.
In your test script, you're now using:
plan(multicore, workers = 1)
Note that with workers = 1
, this will fall back to plan(sequential)
by design. So, if you really want to test with parallelization, use:
plan(multicore, workers = 2)
Also, only now I spotted a mistake in your old code;
https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80
That cl <- makeCluster(cv_cores, outfile = "")
is never used anywhere. You basically created a new PSOCK cluster with cv_cores
workers each time pareg()
was called, and that cluster was never stopped it. This means, if you'd call pareg(..., cv_cores = 8)
ten times, you'd had 80 parallel background workers just sitting there waiting. They would not be shut down until you quit R (and I'm not even sure they'd be shut down automatically on MS Windows).
I suspect you meant to do:
if (cv_cores == 1) {
registerDoSeq()
} else {
cl <- makeCluster(cv_cores, outfile = "") # enable printing to stdout
on.exit(stopCluster(cl))
registerDoParallel(cl)
}
but not sure. Note that this would run a PSOCK cluster also on Linux and macOS, just like on MS Windows. That corresponds to:
plan(multisession, workers = cv_cores)
and not
plan(multicore, workers = cv_cores)
So, I think you need to decide if you want to do 'multicore' (forked processing) or 'multisession' (PSOCK cluster) here. Personally, I recommend 'multisession' because forked processing is not stable on all systems, and might fail if the parallelized code run multi-threaded code (as mentioned in those links in my previous page).
The next mental step is to drop the argument cv_cores
all along, and instead instruct the user to set whatever parallel backend they prefer, i.e. plan(multisession)
, plan(future.callr::callr)
, ... The motto of the future framework is that the developer (you) decided what code to parallelize, whereas the end-user decided how to parallelize. (As a developer, we can't really know what compute resources each user has).
In your test script, you're now using:
plan(multicore, workers = 1)
Note that with
workers = 1
, this will fall back toplan(sequential)
by design. So, if you really want to test with parallelization, use:plan(multicore, workers = 2)
So far the tests were sequential on purpose, but you're right that a parallelized test should also be included. I added one now, thanks!
Also, only now I spotted a mistake in your old code;
https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80
That
cl <- makeCluster(cv_cores, outfile = "")
is never used anywhere. You basically created a new PSOCK cluster withcv_cores
workers each timepareg()
was called, and that cluster was never stopped it. This means, if you'd callpareg(..., cv_cores = 8)
ten times, you'd had 80 parallel background workers just sitting there waiting. They would not be shut down until you quit R (and I'm not even sure they'd be shut down automatically on MS Windows).I suspect you meant to do:
if (cv_cores == 1) { registerDoSeq() } else { cl <- makeCluster(cv_cores, outfile = "") # enable printing to stdout on.exit(stopCluster(cl)) registerDoParallel(cl) }
but not sure.
Woops, good catch. You're absolutely right!
So, I think you need to decide if you want to do 'multicore' (forked processing) or 'multisession' (PSOCK cluster) here. Personally, I recommend 'multisession' because forked processing is not stable on all systems, and might fail if the parallelized code run multi-threaded code (as mentioned in those links in my previous page).
My main goal was to be able to run my parallelized cross-validation code without the need for an explictly installed pareg
package (so that my Snakemake benchmarks can just rely on devtools::load_all
).
It seemed like forking could eventually achieve something like this. But I then started using my own LSF parallelization wrapper and this approach didn't matter that much anymore.
The next mental step is to drop the argument
cv_cores
all along, and instead instruct the user to set whatever parallel backend they prefer, i.e.plan(multisession)
,plan(future.callr::callr)
, ... The motto of the future framework is that the developer (you) decided what code to parallelize, whereas the end-user decided how to parallelize. (As a developer, we can't really know what compute resources each user has).
I totally agree with this mentality, I read about it multiple times when learning about futures :-)
My current idea is that I am already following this approach, because cv_cores
is set to NULL
by default. So the "plan paradigm" works as it should. The cv_cores
parameter mostly remains as a convenience wrapper.
My current idea is that I am already following this approach, because
cv_cores
is set toNULL
by default. So the "plan paradigm" works as it should. Thecv_cores
parameter mostly remains as a convenience wrapper.
I see. That's great.
Sounds like everything is good now.
Great! Thank you so much for your comments :-)
Hi, I'm running reverse dependency checks on future and happen to spot the following comment:
https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80
Note that
output = ""
only works on some R setups (typically only on Unix and in the terminal). It's also a false output in the sense that it's never seen by R, e.g. you cannot usecapture.output()
to grab it.(Disclaimer: I'm the author) If you instead parallelize via the future framework, all your output (
cat()
,print()
,message()
,warning()
, and so on) is truly relayed back to the main R session. In your case, the switch would be to something like: