Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
65 stars 29 forks source link

For Developers Vignette Section Lacks a Code Example #217

Closed DarioS closed 2 years ago

DarioS commented 2 years ago

Developers wishing to invoke back-ends other than MulticoreParam, or to write code that works across Windows, macOS and Linux, need to take special care to ensure that required packages, data, and functions are available and loaded on the remote nodes.

I have recently reviewed a couple of Bioconductor packages - one before submission and one in submission - and I notice that they simply use bplapply and don't allow the user to specify a backend. I would like to have an example to point them to which shows how to create a loop that parallelises on any operating system. Also, I wonder if bpparam() could have ... as a formal argument. It currently automatically chooses a suitable backend depending on the operating system, but it would be nice if workers could be passed in via ... to customise the number of CPUs but still doing automatic selection of backend.

integer(1) Number of workers. Defaults to all cores available as determined by detectCores.

We have had issues with this on a shared departmental server when two or more people run an analysis at the same time and the server gets overburdened. Also, I think the text is outdated. I see it's actually "the maximum of 1 and parallel::detectCores() - 2." from Details section if I create a param object so the Arguments section text could use an update.

mtmorgan commented 2 years ago

The quote

Developers wishing to invoke back-ends ...

is really about ensuring that the workers have appropriate packages accessible, etc, rather than choosing the backend appropriate for the OS. E.g., a worker function using findOverlaps() might need to instead specify GenomicRanges::findOverlaps(); this problem has been mitigated quite a bit with work from Jiefei in the last release cycle. Please feel free to provide a pull request with updated documentation.

If the developer says bplapply(1:10, sqrt), then the user can chose a backend using, e.g., register(MulticoreParam(workers = 4)). I think this is a flexible and straight-forward way of supporting typical use cases; there are additional ways (see next paragraph) for specifying worker number in particular.

The environment variables R_PARALLELLY_AVAILABLECORES_FALLBACK, BIOCPARALLEL_WORKER_NUMBER and the option mc.cores allow the user / administrator to regulate the number of cores. I personally feel this is better than implementing functionality (parallel evaluation) that by design does not exploit resources available. You're right that the documentation is out of date and again a (separate) pull request is welcome.