Closed jeancochrane closed 2 weeks ago
- multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.
Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check interactive()
and use multisession
if true and multicore
if false?
- multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.
Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check
interactive()
and usemultisession
if true andmulticore
if false?
I haven't tried running multicore
in awhile in RStudio. Maybe it works fine now? Let's test it. If it wigs out we can add an interactive()
check.
RStudio does indeed seem to block multicore
parallelization:
> plan(multicore, workers = ceiling(num_threads / 2))
Warning message:
In supportsMulticoreAndRStudio(...) :
[ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R
from RStudio because it is considered unstable. For more details, how to control forked
processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore
Checking htop
confirms that this causes furrr
to fall back to the sequential
strategy. I updated the code in 583dd38be48eb4a70ddaa44f946918fc3191e18d to use multicore
if supportsMulticore
is TRUE
and multisession
otherwise.
This PR updates the
evaluate
stage of the pipeline to switch from the multisession parallelization strategy to multicore. This change is intended to fix the behavior we've been seeing on the server whereby theevaluate
stage takes so long to complete that it makes development difficult.I'm not sure why this behavior appears to be different from last year, but my experiments with trying different numbers of workers using the "multisession" strategy on a minimal reprex revealed that execution begins to slow down when the number of workers increases past 5 (minimum runtime ~10 minutes). There must be some sort of overhead that the background R processes incur, but I couldn't find anything in the docs explaining it. Switching to the "multicore" strategy resolves this problem of diminishing returns, but incurs the risk of using more memory (due to forked process isolation) and more CPU resources (due to execution using logical cores rather than threads). In order to mitigate this risk, we reduce the number of workers to half the available logical cores on the machine that runs the pipeline. With 16 cores on the server, this causes the
evaluate
stage to execute fast enough (~80 seconds) that I'm not too worried about hogging resources.Note that this change also decreases the execution time for this stage on Batch from 200s to 60s.