Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
65 stars 29 forks source link

Let environment variables override `workers = ` argument #229

Closed mtmorgan closed 1 year ago

mtmorgan commented 1 year ago

Setting an environment variable (e.g., by a server administrator or build system manager ;) ) can be undermined when the user explicitly requests more workers

> Sys.setenv(BIOCPARALLEL_WORKER_NUMBER=2)
> ## current...
> bpnworkers(MulticoreParam())
[1] 2
> ## undermined!
> bpnworkers(MulticoreParam(workers = 4))
[1] 4

Instead, enforce the environment variable as a maximum.

Here's the current function

> BiocParallel:::.detectCores
function ()
{
    result <- .getEnvironmentVariable("R_PARALLELLY_AVAILABLECORES_FALLBACK")
    result <- .getEnvironmentVariable("BIOCPARALLEL_WORKER_NUMBER",
        result)
    if (is.na(result)) {
        result <- parallel::detectCores()
        if (is.na(result))
            result <- 1L
        result <- max(1L, result - 2L)
    }
    result <- getOption("mc.cores", result)
    tryCatch({
        result <- as.integer(result)
        if (length(result) != 1L || is.na(result) || result <
            1L)
            stop("number of cores must be a non-negative integer")
    }, error = function(e) {
        msg <- paste0(conditionMessage(e), ". ", "Did you mis-specify R_PARALLELLY_AVAILABLECORES_FALLBACK, ",
            "BIOCPARALLEL_WORKER_NUMBER, or options('mc.cores')?")
        .stop(msg)
    })
    if (nzchar(Sys.getenv("_R_CHECK_LIMIT_CORES_")))
        result <- min(result, 2L)
    if (nzchar(Sys.getenv("BBS_HOME")))
        result <- min(result, 4L)
    result
}

For determining maximum cores -- do we just take this as the max regardless of what the user specifies, or is there some other ordering??

hpages commented 1 year ago

Would be nice to have 2 env. variables for this e.g. BIOCPARALLEL_DEFAULT_WORKER_NUMBER and BIOCPARALLEL_MAX_WORKER_NUMBER.

BIOCPARALLEL_DEFAULT_WORKER_NUMBER would just be a rename of BIOCPARALLEL_WORKER_NUMBER, and it would keep doing what the latter does, that is, set the default number of workers.

BIOCPARALLEL_MAX_WORKER_NUMBER would guarantee that no matter what the user does e.g.

PARAM <- MulticoreParam(workers=55)

or

PARAM <- MulticoreParam(); bpworkers(PARAM) <- 55

the back-end will never use more workers than the number specified via BIOCPARALLEL_MAX_WORKER_NUMBER. Things like MulticoreParam(workers=55) or bpworkers(PARAM) <- 55 should probably issue a warning that the nb of workers that gets effectively set on the MulticoreParam object (or any other BiocParallelParam derivative) is different from the requested one.

Jiefei-Wang commented 1 year ago

I agree with Herve's comment. Recall that there are some cases where the number of workers needs to be larger than the number of cores to make parallel computing efficient, such as network resources, job dependencies, etc. It is not a good idea to give a hard limit on the max worker number without allowing the user to change it.

BIOCPARALLEL_MAX_WORKER_NUMBER can be a solution., but what is the use case for this feature?

mtmorgan commented 1 year ago

_R_CHECK_LIMIT_CORES_ is actually a maximum number of cores in the {parallel} package.

> Sys.setenv(`_R_CHECK_LIMIT_CORES_`=2)
> mclapply(1:8, function(...) Sys.getpid(), mc.cores=3)
Error in .check_ncores(cores) : 3 simultaneous processes spawned

so maybe it should just be re-implemented as such in BiocParallel; currently it's treated as a default

> bpnworkers(BiocParallel::MulticoreParam())
[1] 2
> bpnworkers(BiocParallel::MulticoreParam(4))
[1] 4

@HenrikBengtsson {parallelly} also seems to treat _R_CHECK_LIMIT_CORES_ as a default rather than maximum?

> parallelly::makeClusterPSOCK(4)
Socket cluster with 4 nodes where 4 nodes are on host ‘localhost’ (R version 4.2.1 Patched (2022-10-03 r83008), platform aarch64-apple-darwin21.6.0)

@Jiefei-Wang the use case is when a system administrator or build system manager wants to set a hard limit on number of processes in use by a single user on a multi-user system.

EDIT actually, _R_CHECK_LIMIT_CORES_ is meant to be TRUE, FALSE, or warn (R-ints documentation).

mtmorgan commented 1 year ago

Hmm, maybe noting that it's not clear how to make a hard limit, for instance an environment variable or R option can be overwritten by the user doing 'crazy' things... Sys.setenv("_R_CHECK_LIMIT_CORES_" = FALSE); ...

Jiefei-Wang commented 1 year ago

I think the ultimate problem is that if the user does not want to obey our hard limit, he can break it anyway. Either by changing the environment variable or simply starting two separate R sessions. Therefore, I will call the environment variable the suggested number of processes with the hope that the user can respect it.

With that in mind, I think we can define BIOCPARALLEL_MAX_WORKER_NUMBER. In case the argument workers is provided and larger than BIOCPARALLEL_MAX_WORKER_NUMBER, we can give a message to the user. Something like

It looks like the worker number you request is more than the system administrator expects(show the limit here).
Please consider reducing the worker number

We need a new environment anyway, as the user might use BIOCPARALLEL_WORKER_NUMBER to control the worker number

mtmorgan commented 1 year ago

I closed this with 3f65b9a7618861ea25b0515ee09ebc1daa418a96; it seems like there are situations where the requirement is to limit as much as possible the number of workers, so not providing advice but rather enforcing with a warning. The environment variables enforcing this are BIOCPARALLEL_WORKER_MAX, _R_CHECK_LIMIT_CORES_ (2), BBS_HOME (4) and the intrinsic limit on the number of socket connections (128 - the number of already open connections).

HenrikBengtsson commented 1 year ago

Hi, I'm a bit late to the game:

@HenrikBengtsson {parallelly} also seems to treat _R_CHECK_LIMIT_CORES_ as a default rather than maximum?

parallelly::makeClusterPSOCK(4)
Socket cluster with 4 nodes where 4 nodes are on host ‘localhost’
(R version 4.2.1 Patched (2022-10-03 r83008), platform aarch64-apple-darwin21.6.0)

Correct-ish. To clarify, parallelly::availableCores() detects various settings, including _R_CHECK_LIMIT_CORES_, e.g.

> parallelly::availableCores(which = "all")
               system        cgroups.cpuset 
                    8                     4 
                nproc _R_CHECK_LIMIT_CORES_ 
                    4                     2

The default is which = "min", i.e. with the above settings it returns:

> parallelly::availableCores()
_R_CHECK_LIMIT_CORES_ 
                    2

However, the availableCores() function does not (and should not) enforce any limits. It's just used to query the system.

It would be up to whatever function that tries to create a set of local workers to enforce a hard upper limit. It's correct that parallelly::makeClusterPSOCK() does not do this. There are two reasons for this: (i) I've punted on going down that path, and (ii) the purpose of parallelly is to be as similar to parallel as possible, while adding for features and improvements. It's also true that future does not try to enforce an upper limit of local workers.

One way to think about this problem is like we think of environment variables and command-line options, i.e. a command-line option trumps an environment variable. In R, I think about it as an argument to a function trumps the default set by an R option, which in turn defaults to an environment variable (if set). To me, that's the baseline.

To enforce an upper limit, I think there are two things to be configured:

  1. the upper limit, i.e. an R option (e.g. parallel.max.workers), which defaults to an environment variable (e.g. R_PARALLEL_MAX_WORKERS). If neither is set, the default upper limit could be +Inf

  2. action to take when breaking the upper limit, e.g. ignore, warn, error, enforce-silently, and enforce-with-warning. This can be controlled by an R option (e.g. parallel.max.workers.action) and environment variable (e.g. R_PARALLEL_MAX_WORKERS_ACTION).

Now, why do we want to enforce an upper limit? I think the by far most common use case is when a user use too many CPU cores without realizing it. Sometimes they do it by mistake, sometimes they do it because of incorrect assumptions by a script/package developer, where defaulting to detectCores() is by far the most common.

Also, and maybe that's where you're going, packages could be checked with R_PARALLEL_MAX_WORKERS_ACTION=error and R_PARALLEL_MAX_WORKERS_ACTION=2. That would catch R packages with unsafe defaults, which I think would be great.

I don't think it's worth worrying about "malicious" over use - that can always be worked around. I also don't think it's worth worrying about:

cl1 <- parallelly::makeClusterPSOCK(2)
cl2 <- parallelly::makeClusterPSOCK(3)
cl3 <- parallelly::makeClusterPSOCK(c("another-machine-1", "another-machine-2"))
cl <- c(cl1, cl2, cl3)

That could be too complicated to enforce.

BTW, recall that R_PARALLELLY_AVAILABLECORES_FALLBACK, defined my parallelly and recognized by BiocParallel:::.detectCores() is meant to provide the sysadm a way to set the default for parallelly::availableCores() when nothing else have been assigned to the user. This way, a sysadm can let users run, say, two, workers by default, although detectCores() reports 192. So, in a sense, that addresses the use case where a user use more cores than they are aware of.