Output in parallel processing

HenrikBengtsson commented 2 years ago

Hi, I'm running reverse dependency checks on future and happen to spot the following comment:

https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80

Note that output = "" only works on some R setups (typically only on Unix and in the terminal). It's also a false output in the sense that it's never seen by R, e.g. you cannot use capture.output() to grab it.

(Disclaimer: I'm the author) If you instead parallelize via the future framework, all your output (cat(), print(), message(), warning(), and so on) is truly relayed back to the main R session. In your case, the switch would be to something like:

old_plan <- future::plan(future::multisession, workers = cv_cores)
old_dopar <- doFuture::registerDoFuture()
# When done, undo these settings to whatever the user had before
on.exit({
  with(old_dopar, foreach::setDoPar(fun=fun, data=data, info=info))
  future::plan(old_plan)
})

kpj commented 2 years ago

Thanks a lot for your advice!

I created #12 to test your proposed changes. I replaced multisession with multicore to use forking, but the execution now seems to be sequential. Do you have an idea why that might be the case?

HenrikBengtsson commented 2 years ago

Thanks a lot for your advice!

I created #12 to test your proposed changes. I replaced multisession with multicore to use forking, but the execution now seems to be sequential. Do you have an idea why that might be the case?

Are you running in RStudio? If so, see Section 'Support for forked ("multicore") processing' in https://future.futureverse.org/reference/multicore.html and follow the link to https://parallelly.futureverse.org/reference/supportsMulticore.html for how to re-enable it, e.g. temporarily just for your function. But be careful to read up on the details for why it's disabled by default.

kpj commented 2 years ago

That's a great point, thanks a lot for clearing it up. I was indeed using RStudio. Running the same code from the terminal works perfectly.

I now also replaced foreach::%dopar% with doRNG::%dorng%. Do you have any other suggestions? Otherwise I'd merge the PR.

HenrikBengtsson commented 2 years ago

That's a great point, thanks a lot for clearing it up. I was indeed using RStudio. Running the same code from the terminal works perfectly.

:+1:

I now also replaced foreach::%dopar% with doRNG::%dorng%.

Oh, great. Happy to hear it helped to catch a potential RNG issue.

Do you have any other suggestions? Otherwise I'd merge the PR.

In your test script, you're now using:

plan(multicore, workers = 1)

Note that with workers = 1, this will fall back to plan(sequential) by design. So, if you really want to test with parallelization, use:

plan(multicore, workers = 2)

Also, only now I spotted a mistake in your old code;

https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80

That cl <- makeCluster(cv_cores, outfile = "") is never used anywhere. You basically created a new PSOCK cluster with cv_cores workers each time pareg() was called, and that cluster was never stopped it. This means, if you'd call pareg(..., cv_cores = 8) ten times, you'd had 80 parallel background workers just sitting there waiting. They would not be shut down until you quit R (and I'm not even sure they'd be shut down automatically on MS Windows).

I suspect you meant to do:

 if (cv_cores == 1) { 
   registerDoSeq()
 } else { 
   cl <- makeCluster(cv_cores, outfile = "") # enable printing to stdout 
   on.exit(stopCluster(cl))
   registerDoParallel(cl)
 }

but not sure. Note that this would run a PSOCK cluster also on Linux and macOS, just like on MS Windows. That corresponds to:

plan(multisession, workers = cv_cores)

and not

plan(multicore, workers = cv_cores)

So, I think you need to decide if you want to do 'multicore' (forked processing) or 'multisession' (PSOCK cluster) here. Personally, I recommend 'multisession' because forked processing is not stable on all systems, and might fail if the parallelized code run multi-threaded code (as mentioned in those links in my previous page).

The next mental step is to drop the argument cv_cores all along, and instead instruct the user to set whatever parallel backend they prefer, i.e. plan(multisession), plan(future.callr::callr), ... The motto of the future framework is that the developer (you) decided what code to parallelize, whereas the end-user decided how to parallelize. (As a developer, we can't really know what compute resources each user has).

kpj commented 2 years ago

In your test script, you're now using:
plan(multicore, workers = 1)
Note that with workers = 1, this will fall back to plan(sequential) by design. So, if you really want to test with parallelization, use:
plan(multicore, workers = 2)

So far the tests were sequential on purpose, but you're right that a parallelized test should also be included. I added one now, thanks!

Also, only now I spotted a mistake in your old code;

https://github.com/cbg-ethz/pareg/blob/b7ce2d6e9488b1eb1fdcdd784ce8a4396b7bb97f/R/main.R#L75-L80

That cl <- makeCluster(cv_cores, outfile = "") is never used anywhere. You basically created a new PSOCK cluster with cv_cores workers each time pareg() was called, and that cluster was never stopped it. This means, if you'd call pareg(..., cv_cores = 8) ten times, you'd had 80 parallel background workers just sitting there waiting. They would not be shut down until you quit R (and I'm not even sure they'd be shut down automatically on MS Windows).

I suspect you meant to do:
 if (cv_cores == 1) { 
   registerDoSeq()
 } else { 
   cl <- makeCluster(cv_cores, outfile = "") # enable printing to stdout 
   on.exit(stopCluster(cl))
   registerDoParallel(cl)
 }
but not sure.

Woops, good catch. You're absolutely right!

So, I think you need to decide if you want to do 'multicore' (forked processing) or 'multisession' (PSOCK cluster) here. Personally, I recommend 'multisession' because forked processing is not stable on all systems, and might fail if the parallelized code run multi-threaded code (as mentioned in those links in my previous page).

My main goal was to be able to run my parallelized cross-validation code without the need for an explictly installed pareg package (so that my Snakemake benchmarks can just rely on devtools::load_all). It seemed like forking could eventually achieve something like this. But I then started using my own LSF parallelization wrapper and this approach didn't matter that much anymore.

The next mental step is to drop the argument cv_cores all along, and instead instruct the user to set whatever parallel backend they prefer, i.e. plan(multisession), plan(future.callr::callr), ... The motto of the future framework is that the developer (you) decided what code to parallelize, whereas the end-user decided how to parallelize. (As a developer, we can't really know what compute resources each user has).

I totally agree with this mentality, I read about it multiple times when learning about futures :-) My current idea is that I am already following this approach, because cv_cores is set to NULL by default. So the "plan paradigm" works as it should. The cv_cores parameter mostly remains as a convenience wrapper.

HenrikBengtsson commented 2 years ago

My current idea is that I am already following this approach, because cv_cores is set to NULL by default. So the "plan paradigm" works as it should. The cv_cores parameter mostly remains as a convenience wrapper.

I see. That's great.

Sounds like everything is good now.

kpj commented 2 years ago

Great! Thank you so much for your comments :-)

cbg-ethz / pareg

Output in parallel processing #10