makeClusterPSOCK(): Add a wait-and-retry mechanism for setting up nodes when they fail

HenrikBengtsson commented 7 years ago

Add a wait-and-retry mechanism for setting up nodes when they fail. A node may fail to be set up due to the port being busy, e.g.

> cl <- makeClusterPSOCK(hosts)
  Warning in makeClusterPSOCK(workers, ...) : NAs introduced by coercion
  Warning in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
    port 11049 cannot be opened
  Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : 
    cannot open the connection
  Calls: plan ... makeClusterPSOCK -> makeNode -> <Anonymous> -> socketConnection
  Execution halted

Comment: This type of error happens occasionally on the CRAN test servers because other parallel check processes may occupy one of the ports needed by the package being tested. I see this with the future package tests from time to time; the error often goes away the next CRAN-check cycle.

A wait-and-retry feature could be:

cl <- makeClusterPSOCK(hosts, retry = 5L, sleep = 30.0)

which would wait 30 seconds before retrying to set up a node. If it fails five times, no more retries will be done and the error will propagate up.

The arguments could default to:

cl <- makeClusterPSOCK(hosts, 
      retry = getOption("future.makeClusterPSOCK.retry", 0L), 
      sleep = getOption("future.makeClusterPSOCK.sleep", 60))

HenrikBengtsson commented 6 years ago

I did some more investigation and it could be a bug in socketConnection(...,, timeout). I've posted a question in R-devel thread 'parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)' on 2018-03-08 (https://stat.ethz.ch/pipermail/r-devel/2018-March/075676.html) asking about this.

A workaround to lower the risk for the observed behaviour could be to add a little bit of startup delay before launching the background workers. This can be done by injecting -e "Sys.sleep(2)" in the system call to launch the background worker, e.g.

  system('R -e "Sys.sleep(2)" --slave --no-restore -e "parallel:::.slaveRSOCK()" --args MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)

HenrikBengtsson commented 6 years ago

I've submitted patch PR17391 for R-devel (diff)

HenrikBengtsson commented 6 years ago

This patch has now been incorporated into R-devel 74417 so it'll be part of R 3.5.0 (targeted for April 2018).

I'll assume that this will fix most/all(?) of those sporadic "cannot open the connection" errors randomly seen on CRAN check and Travis CI. Closing, but will reopen if my assumption is wrong.

HenrikBengtsson commented 6 years ago

Hmm... the patch was added in R-devel r74417 and below there is a related error using r74420 on https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/future-00check.html:

* using R Under development (unstable) (2018-03-17 r74420)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘future/DESCRIPTION’ ... OK
* this is package ‘future’ version ‘1.7.0’
* ...
* checking tests ... [15s/58s] ERROR
[...]
Running the tests in ‘tests/future,labels.R’ failed.
[...]
  - plan('multisession') ...
  plan(): plan_init() of 'multisession', 'cluster', 'multiprocess', 'future', 'function' ...
  multisession:
  - args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, persistent = FALSE, workers = availableCores(), gc = FALSE, earlySignal = FALSE, label = NULL, ...)
  - tweaked: FALSE
  - call: plan(strategy)
  Workers: [n = 2] 'localhost', 'localhost'
  Warning in makeClusterPSOCK(workers, ...) : NAs introduced by coercion
  Base port: 11649
  Creating node 1 of 2 ...
  - setting up node
  Starting worker HenrikBengtsson/future#1 on 'localhost': '/home/hornik/tmp/R.check/r-devel-gcc/Work/build/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11649 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE
  Waiting for worker HenrikBengtsson/future#1 on 'localhost' to connect back
  Warning in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
    port 11649 cannot be opened
  Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : 
    cannot open the connection
  Calls: plan ... makeClusterPSOCK -> makeNode -> <Anonymous> -> socketConnection
  Execution halted

BTW, as far as I can tell, the Warning in makeClusterPSOCK(workers, ...) : NAs introduced by coercion occurs in the following two lines of makeClusterPSOCK():

      port <- Sys.getenv("R_PARALLEL_PORT", NA_character_)
      port <- as.integer(port)

For example:

> options(warn = 1L)
> Sys.setenv(R_PARALLEL_PORT = "dummy")
> cl <- future::makeClusterPSOCK("localhost", verbose = TRUE)
Workers: [n = 1] 'localhost'
Warning in future::makeClusterPSOCK("localhost", verbose = TRUE) :
  NAs introduced by coercion
Base port: 11410
[...]

There is no warning if R_PARALLEL_PORT="" or an integer or a numeric value.

renkun-ken commented 5 years ago

I've encountered this error on my production systems recently. I'm not sure if the retry includes re-selecting a random port? If a port within the specified range (11000:11999 in default) is occupied by other processes, re-selecting a port becomes more effective than retry with the same port.

HenrikBengtsson / parallelly

makeClusterPSOCK(): Add a wait-and-retry mechanism for setting up nodes when they fail #15