Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
67 stars 29 forks source link

BiocParallel should default to serial execution, even with MulticoreParam() #141

Closed mtmorgan closed 3 years ago

mtmorgan commented 3 years ago

This seems to be the consensus of a slack conversation (https://community-bioc.slack.com/archives/CLUJWDQF4/p1621279407150700). parallelly::availableCores() is a more complicated solution that might appeal to some.

HenrikBengtsson commented 3 years ago

See https://parallelly.futureverse.org/#availablecores-vs-paralleldetectcores for a motivation behind parallelly::availableCores() and how it compares to low-level parallel::detectCores().

mtmorgan commented 3 years ago

As a user I want to be able to use, interactively bplapply(1:10, FUN) and have parallel evaluation, otherwise I would have used lapply(). So I'm reconsidering changing the default. Instead, the branch https://github.com/Bioconductor/BiocParallel/tree/BIOCPARALLEL_WORKER_NUMBER introduces an environment variable BIOCPARALLEL_WORKER_NUMBER that cluster administrators can set to limit the default number of workers.

HenrikBengtsson commented 3 years ago

Thanks for adding this type of feature. It'll help sysadms to keep shared computers running smoothly. I just want to make sure you're aware I've already defined the environment variable:

R_PARALLELLY_AVAILABLECORES_FALLBACK

in parallelly for this purpose, cf. https://parallelly.futureverse.org/reference/parallelly.options.html. For example, the sysadm can set:

R_PARALLELLY_AVAILABLECORES_FALLBACK=1

globally to "signal" that the preferred number of cores R should use is one, unless overridden by arguments, user settings, or schedulers.

I know it might be a bit obscure for pure BioParallel developers, but would you consider using the R_PARALLELLY_AVAILABLECORES_FALLBACK name instead? That way sysadms/users/devs only have to know about one rather that two environment variable names to control this.

It's been in future since Feb 2017 and moved over to parallelly when that was first released. As I write in https://parallelly.futureverse.org/#availablecores-vs-paralleldetectcores, I really wish R itself would define such a setting so it doesn't have to be re-invented in multiple places, e.g. https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17641.

DarwinAwardWinner commented 3 years ago

Is there any reason it can't support both? Give priority to BIOCPARALLEL_WORKER_NUMBER since that is package-specific, and if that is unset fall back to the more general R_PARALLELLY_AVAILABLECORES_FALLBACK. And perhaps, if one doesn't already exist, some kind of function to register the default param, with a verbose option to explain how it chose what param and options to use, e.g. something like "Found environment variable BIOCPARALLEL_WORKER_NUMBER=4, using registering MulticoreParam(4)".

HenrikBengtsson commented 3 years ago

Just to clarify my comment: If BIOCPARALLEL_WORKER_NUMBER had already existed, I probably wouldn't have introduced R_PARALLELLY_AVAILABLECORES_FALLBACK but instead had parallelly acknowledge BIOCPARALLEL_WORKER_NUMBER (*). My main concern here is that the people who are the most interested in setting this type of environment variables are sysadms and we shouldn't expect them to be well versed in the R ecosystem, so for their sake we should have a single common way of doing this and whatever internet searches they do should point to the same solution. So, I'm basically hoping to minimize https://xkcd.com/927/

(*) Either way, if BiocParallel introduces BIOCPARALLEL_WORKER_NUMBER, I'll update parallelly to recognize that as well so that such settings are respected by any code using parallelly::availableCores(), and parallelly::availableWorkers(), e.g. future::plan(multicore), future::plan(multisession), ..., https://cran.r-project.org/package=parallelly

HenrikBengtsson commented 3 years ago

Give priority to BIOCPARALLEL_WORKER_NUMBER since that is package-specific, ...

I'd like to argue that this type of settings should be global for R that all parallelization code should respect. I'm not sure if there's a good argument for such a setting to be package specific, or specific to Bioconductor pipelines.

mtmorgan commented 3 years ago

I added support for R_PARALLELLY_AVAILABLECORES_FALLBACK, which is over-ridden by BIOCPARALLEL_WORKER_NUMBER etc. I didn't support this initially because 'available cores' isn't really what is relevant, and because it's strange to use functionality from a package that is not even Suggest'ed. Also from the parallely documentation (man page and https://parallelly.futureverse.org/#availablecores-vs-paralleldetectcores) it seems like this env. variable is both more ambitious (better integrated with batch jobs, for instance) and more restrictive (not influencing PSOCK worker number?). And of course it's life and documentation is outside the control of the BiocParallel package.

BiocParallel tries to re-use code and concepts from the parallel package, so if there were relevant environment variable(s) there then these would certainly be preferable.

HenrikBengtsson commented 3 years ago

Thanks. I'll update parallelly to acknowledge BIOCPARALLEL_WORKER_NUMBER.

... because 'available cores' isn't really what is relevant ...

Could you elaborate a bit on this? Is it a technical and philosophical comment?

... because it's strange to use functionality from a package that is not even Suggest'ed.

I agree with this.

Also from the parallely documentation [...] it seems like this env. variable is both more ambitious (better integrated with batch jobs, for instance) ...

The R_PARALLELLY_AVAILABLECORES_FALLBACK environment variable is designed to take precedence over parallel:detectCores() if set (just like BIOCPARALLEL_WORKER_NUMBER). parallelly::availableCores() respects this design. Now, if there are other settings, e.g. option mc.cores or environment variables such as those set by HPC schedulers to communicate what's available to the job process, then, in turn, parallelly::availableCores() is implemented so those take precedence. This is because we trust the scheduler to report the right thing. The thing we trust the least is detectCores().

...and more restrictive (not influencing PSOCK worker number?)

Nah - the logic is parallelly::availableCores() is that looks at options, environment variables, parallel::detectCores() and nproc, and returns the number of cores according to above priority order. Is there a specific passage that I could update to make this more clear?

BTW, when using parallelly::makeClusterPSOCK(workers) one has to specify workers explicitly, but I'm considering making workers = parallelly::availableCores() the default at some point. When using futures, then plan(multisession) defaults to plan(multisession, workers = parallelly::availableCores()), which uses parallelly::makeClusterPSOCK(workers) to set up the PSOCK workers.

BiocParallel tries to re-use code and concepts from the parallel package, so if there were relevant environment variable(s) there then these would certainly be preferable.

That's the strategy I use for parallelly too. If you have the energy and spare cycles, it would be great if you could pitch in on https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17641 of introducing a de-facto standard environment variable for this in R, especially since you're part of the R Core team. From that thread, it sounds like Tomas K doesn't want to change the behavior of detectCores(), but if we could document something like:

R code that runs in parallel should not run using more CPU cores than environment variable R_MAX_CORES specified, if set.

I think that would be a step forward.

PS. Since CRAN package can depend on BiocParallel, you could also check environment variable _R_CHECK_LIMIT_CORES_, and if set, then return 2 (the maximum number parallel workers CRAN allows during checks). This would automatically save an CRAN package maintainer from overuse when one of their package dependencies rely on BiocParallel.

mtmorgan commented 3 years ago

Thanks Henrik for your input.

HenrikBengtsson commented 3 years ago

FYI, in the next release of parallelly we'll have:

One can't really control the number of cores used by a worker (e.g., perhaps R has a multi-threaded BLAS, or the worker itself uses some other parallelization package), and BiocParallel the number of cores is used to determine the default number of workers launched. So it seems appropriate to call the variable after what it's being used for, rather than what the variable is providing a proxy for.

I agree.

Regarding controlling nested parallelism: In the future framework (not parallelly, which I try to keep low-level), I'm trying to at least pass on info to workers to give parallel code a chance to know they're running in an already parallel worker. I do this by setting different options and environment variables (e.g. mc.cores=1 and soon also BIOCPARALLEL_WORKER_NUMBER=1), which is then hopefully acknowledged (which they for instance are if availableCores() is used). I'm also disabling some multi-threading by default when running in a forked child process because there, multi-threading has caused problems in the past. It's a long journey, but I hope to eventually reach 90-95% of the wild-wild-west nested parallelization that otherwise might take place.