Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
67 stars 29 forks source link

WISH: Support also SnowParam(type = "PSOCK") #231

Open HenrikBengtsson opened 1 year ago

HenrikBengtsson commented 1 year ago

Background

SnowParam() supports type = "SOCK" (default), type = "MPI", and type = "FORK". The former two stems from the days of snow package and the latter was introduced with the parallel package. The type argument is passed to parallel::makeCluster() as-is;

> parallel::makeCluster
function (spec, type = getClusterOption("type"), ...) 
{
    switch(type, PSOCK = makePSOCKcluster(names = spec, ...), 
        FORK = makeForkCluster(nnodes = spec, ...), SOCK = snow::makeSOCKcluster(names = spec, 
            ...), MPI = snow::makeMPIcluster(count = spec, ...), 
        stop("unknown cluster type"))
}
<environment: namespace:parallel>

Wish

Please add support also for type = "PSOCK", which is the default for parallel::makeCluster() [since day one back in 2014, I think]. It looks like it would be quite straightforward to do this.

Why add this? Because, PSOCK clusters have undergone lots of improvements since snow was incorporated into parallel. For example, in R (>= 4.0.0), the nodes ("workers") of PSOCK cluster is set up in parallel, instead of sequentially. This makes the setup much faster, e,g.

image

Source: https://www.jottr.org/2021/06/10/parallelly-1.26.0/

In addition, this parallel setup strategy avoids port clashes that we saw in parallel (< 4.0.0), and still in snow (since it's deprecated and not improved on), e.g.

  Error in `socketConnection(port = port, server = TRUE, blocking = TRUE, 
      open = "a+b")`: cannot open the connection

FYI, I haven't seen those type of errors since R (< 4.0.0), except from revdep checking packages relying on snow. More recently while revdep checking Bioconductor package DMCFB that uses SnowParam in it's package tests.

mtmorgan commented 1 year ago

Buried in the help page ?SnowParam is this note:

    NOTE: The \code{PSOCK} cluster from the \code{parallel} package does not
    support cluster options \code{scriptdir} and \code{useRscript}. \code{PSOCK}
    is not supported because these options are needed to re-direct to an
    alternate worker script located in BiocParallel.

But naive testing suggests this no longer seems to be the case (either because of changes in parallel or BiocParallel) so I have started a 'PSOCK' branch.

Is there an easy way to generate the socket connection error?

HenrikBengtsson commented 1 year ago

Buried in the help page ?SnowParam is this note:

    NOTE: The \code{PSOCK} cluster from the \code{parallel} package does not
    support cluster options \code{scriptdir} and \code{useRscript}. \code{PSOCK}
    is not supported because these options are needed to re-direct to an
    alternate worker script located in BiocParallel.

But naive testing suggests this no longer seems to be the case (either because of changes in parallel or BiocParallel) ...

I missed that note. I don't think I've ever seen argument scriptdir or useRscript in the parallel package. They don't appear if one searches https://hughjonesd.shinyapps.io/rcheology/.

Looking at snow, it looks like scriptdir is used to point to the R script that runs the parallel workers. If so, then that's handled by parallel without scripts using an internal function, e.g.

'/path/to/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.workRSOCK()' MASTER=localhost PORT=11312 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

... so I have started a 'PSOCK' branch.

Excellent.

Is there an easy way to generate the socket connection error?

I don't think so. It's a race condition that appears when many R processes try to create a cluster using the same port. Give that the default is randomizing a port from 11000:11999, it only happens once in a while, but if you check enough things in parallel you end up with it often enough for it to add friction. Before R 4.0.0, I did see it once in a while happening to the future package on the CRAN servers, because I do tons of testing there. It disappeared at the next round of checks.

BTW, I'm not sure, but I also think the race condition could also happen to launch parallel workers in one R CMD check and another one would actually connect to those workers. If the latter was faster enough, it could completely successfully, but if the original check terminated before, then it would shut down those workers, breaking the check for the other package. The SOCK/PSOCK protocol does not protect a non-owning R process from connecting, including those ran by other users. This is actually a security issue on multi-user servers, but that's another story.

mtmorgan commented 1 year ago

Yes, parallel's implementation doesn't allow customization of the worker startup script, whereas snow (& therefore SOCK, MPI, FORK) can (and are, by BiocParallel) be customized.

Looking a little more deeply makes it seem likely that BiocParallel's log = TRUE option would be affected, which you can see in the 'Log messages' and 'stdout' sections

> res <- bplapply(1:2, message, BPPARAM = SnowParam(type = "PSOCK", log = TRUE))
############### LOG OUTPUT ###############
Task: 1
Node: 6
Timestamp: 2022-11-21 17:34:29.946785
Success: TRUE

Task duration:
   user  system elapsed 
  0.186   0.007   0.198 

Memory used:
          used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 1209511 64.6    2057557 109.9         NA  2057557 109.9
Vcells 2873984 22.0    8388608  64.0      32768  8388267  64.0

Log messages:

stderr and stdout:
...

versus

 res <- bplapply(1:2, message, BPPARAM = SnowParam(type = "SOCK", log = TRUE))
############### LOG OUTPUT ###############
Task: 2
Node: 5
Timestamp: 2022-11-21 17:34:36.612367
Success: TRUE

Task duration:
   user  system elapsed 
  0.090   0.006   0.109 

Memory used:
          used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 1209512 64.6    2057557 109.9         NA  2057557 109.9
Vcells 2873982 22.0    8388608  64.0      32768  8388267  64.0

Log messages:
INFO [2022-11-21 17:34:36] loading futile.logger package

stderr and stdout:
2

############### LOG OUTPUT ###############
HenrikBengtsson commented 1 year ago

Yes, parallel's implementation doesn't allow customization of the worker startup script, whereas snow (& therefore SOCK, MPI, FORK) can (and are, by BiocParallel) be customized.

You can probably use rscript_args to customize the startup process of each worker, e.g. rscript_args = c("-e", shQuote('setwd("/path/to")')).

FWIW, I've made some of these things easier and more robust in parallelly::makeClusterPSOCK().