REQ - Guidance on writing R code that will run in parallel

Public-Health-Scotland / technical-docs

Technical documentation, including guidance and best practice for Public Health Scotland (PHS)

https://public-health-scotland.github.io/knowledge-base/docs

5 stars 2 forks source link

REQ - Guidance on writing R code that will run in parallel #10

Open terrymclaughlin opened 1 year ago

terrymclaughlin commented 1 year ago

For example:

Explain single-threaded vs multi-threaded and what parallel processing is
Explore the futureverse! 🚀
- Correctly identifying the number of CPUs available to use the session with {parallelly}
- Write functions and interate using the {furrr} package, rather than {purrr}
Use the {multidplyr} backend with {dplyr}

Other useful links:

terrymclaughlin commented 1 year ago

@ciarag01 @jakeybob @Moohan @rmccreath

Just alerting you to this issue. My plan is to draft some guidance, as it's clear that most R code in PHS is being written as single-threaded and not taking advantage of the multiple CPUs available in a Posit Workbench session. This could result in significant performance improvements when processing large datasets.

terrymclaughlin commented 1 year ago

@CliveWG @fraserstirrat

In case you see any queries coming in requesting guidance on parallel processing, you can tell people that this is on our radar and we're developing guidance for this.

Moohan commented 1 year ago

furrr is low-hanging fruit - It requires code to already be written to use purrr but if it has it's a super simple switch. There is a bit of overhead on 'setting up the workers' so it's a subjective call on when it's worth it though. I guess that applies to all of the parallelisation methods though!

jakeybob commented 1 year ago

I'm not sure what the best route is here. All the different available methods make it quite thorny.

I don't think purrr is used that widely internally at the moment. Or at least, I suspect any code that uses purrr heavily was probably written by a techy person who would be able to convert to furrr easily on their own.

And, I feel like any guidance along the lines of "here are several different ways you can do this" won't be well received.

So do we choose one way to recommend...? This would be better for consistency and support/training but a) I'm not convinced this is the best idea and b) even if it is, I don't know which method would be the best to pick...

Should probably sidestep the foreach and doParallel side of things and go with furrr or multidplyr though I guess? They're both tidyverse friendly. multidplyr probably slots into existing dplyr code blocks the easiest and has the smaller mental overhead, but 🤷🏻

Moohan commented 1 year ago

Thought this might be the best place to ask this question, and if no one knows it's just another thing to add to future guidance!

If I use plan(multisession) which is the one you're led to when using RStudio, on PWB will this create new nodes?

For example, if I have a session with 8 CPUs and 4GB of RAM, will this be shared among the 'sessions' or will it spawn new nodes for the new sessions, in which case what limits/specs do they have?

jakeybob commented 1 year ago

I suspect this will run in the current session only and the workers spawned will be more equivalent to "background jobs" (running as independent R processes but sharing the parent session total resources) than "workbench jobs" (starting new sessions with their own resources).

Only one way to find out for sure though – give it a punt and see what happens? 😀