ccao-data / model-condo-avm

Automated valuation model for all class 299 and 399 residential condominiums in Cook County
GNU Affero General Public License v3.0
5 stars 2 forks source link

Switch from multisession parallelization to multicore in `evaluate` stage #53

Closed jeancochrane closed 2 weeks ago

jeancochrane commented 2 weeks ago

This PR updates the evaluate stage of the pipeline to switch from the multisession parallelization strategy to multicore. This change is intended to fix the behavior we've been seeing on the server whereby the evaluate stage takes so long to complete that it makes development difficult.

I'm not sure why this behavior appears to be different from last year, but my experiments with trying different numbers of workers using the "multisession" strategy on a minimal reprex revealed that execution begins to slow down when the number of workers increases past 5 (minimum runtime ~10 minutes). There must be some sort of overhead that the background R processes incur, but I couldn't find anything in the docs explaining it. Switching to the "multicore" strategy resolves this problem of diminishing returns, but incurs the risk of using more memory (due to forked process isolation) and more CPU resources (due to execution using logical cores rather than threads). In order to mitigate this risk, we reduce the number of workers to half the available logical cores on the machine that runs the pipeline. With 16 cores on the server, this causes the evaluate stage to execute fast enough (~80 seconds) that I'm not too worried about hogging resources.

Note that this change also decreases the execution time for this stage on Batch from 200s to 60s.

jeancochrane commented 2 weeks ago
  1. multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.

Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check interactive() and use multisession if true and multicore if false?

dfsnow commented 2 weeks ago
  1. multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.

Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check interactive() and use multisession if true and multicore if false?

I haven't tried running multicore in awhile in RStudio. Maybe it works fine now? Let's test it. If it wigs out we can add an interactive() check.

jeancochrane commented 2 weeks ago

RStudio does indeed seem to block multicore parallelization:

> plan(multicore, workers = ceiling(num_threads / 2))

Warning message:
In supportsMulticoreAndRStudio(...) :
  [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R
from RStudio because it is considered unstable. For more details, how to control forked
processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore

Checking htop confirms that this causes furrr to fall back to the sequential strategy. I updated the code in 583dd38be48eb4a70ddaa44f946918fc3191e18d to use multicore if supportsMulticore is TRUE and multisession otherwise.