Open terrymclaughlin opened 1 year ago
@ciarag01 @jakeybob @Moohan @rmccreath
Just alerting you to this issue. My plan is to draft some guidance, as it's clear that most R code in PHS is being written as single-threaded and not taking advantage of the multiple CPUs available in a Posit Workbench session. This could result in significant performance improvements when processing large datasets.
@CliveWG @fraserstirrat
In case you see any queries coming in requesting guidance on parallel processing, you can tell people that this is on our radar and we're developing guidance for this.
furrr
is low-hanging fruit - It requires code to already be written to use purrr
but if it has it's a super simple switch. There is a bit of overhead on 'setting up the workers' so it's a subjective call on when it's worth it though. I guess that applies to all of the parallelisation methods though!
I'm not sure what the best route is here. All the different available methods make it quite thorny.
I don't think purrr
is used that widely internally at the moment. Or at least, I suspect any code that uses purrr
heavily was probably written by a techy person who would be able to convert to furrr
easily on their own.
And, I feel like any guidance along the lines of "here are several different ways you can do this" won't be well received.
So do we choose one way to recommend...? This would be better for consistency and support/training but a) I'm not convinced this is the best idea and b) even if it is, I don't know which method would be the best to pick...
Should probably sidestep the foreach
and doParallel
side of things and go with furrr
or multidplyr
though I guess? They're both tidyverse
friendly. multidplyr
probably slots into existing dplyr
code blocks the easiest and has the smaller mental overhead, but 🤷🏻
Thought this might be the best place to ask this question, and if no one knows it's just another thing to add to future guidance!
If I use plan(multisession)
which is the one you're led to when using RStudio, on PWB will this create new nodes?
For example, if I have a session with 8 CPUs and 4GB of RAM, will this be shared among the 'sessions' or will it spawn new nodes for the new sessions, in which case what limits/specs do they have?
I suspect this will run in the current session only and the workers spawned will be more equivalent to "background jobs" (running as independent R processes but sharing the parent session total resources) than "workbench jobs" (starting new sessions with their own resources).
Only one way to find out for sure though – give it a punt and see what happens? 😀
For example:
{parallelly}
{furrr}
package, rather than{purrr}
{multidplyr}
backend with{dplyr}
Other useful links: