Closed HenrikBengtsson closed 3 years ago
I did some more investigation and it could be a bug in socketConnection(...,, timeout)
. I've posted a question in R-devel thread 'parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)' on 2018-03-08 (https://stat.ethz.ch/pipermail/r-devel/2018-March/075676.html) asking about this.
A workaround to lower the risk for the observed behaviour could be to add a little bit of startup delay before launching the background workers. This can be done by injecting -e "Sys.sleep(2)"
in the system call to launch the background worker, e.g.
system('R -e "Sys.sleep(2)" --slave --no-restore -e "parallel:::.slaveRSOCK()" --args MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
This patch has now been incorporated into R-devel 74417 so it'll be part of R 3.5.0 (targeted for April 2018).
I'll assume that this will fix most/all(?) of those sporadic "cannot open the connection" errors randomly seen on CRAN check and Travis CI. Closing, but will reopen if my assumption is wrong.
Hmm... the patch was added in R-devel r74417 and below there is a related error using r74420 on https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/future-00check.html:
* using R Under development (unstable) (2018-03-17 r74420)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘future/DESCRIPTION’ ... OK
* this is package ‘future’ version ‘1.7.0’
* ...
* checking tests ... [15s/58s] ERROR
[...]
Running the tests in ‘tests/future,labels.R’ failed.
[...]
- plan('multisession') ...
plan(): plan_init() of 'multisession', 'cluster', 'multiprocess', 'future', 'function' ...
multisession:
- args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, persistent = FALSE, workers = availableCores(), gc = FALSE, earlySignal = FALSE, label = NULL, ...)
- tweaked: FALSE
- call: plan(strategy)
Workers: [n = 2] 'localhost', 'localhost'
Warning in makeClusterPSOCK(workers, ...) : NAs introduced by coercion
Base port: 11649
Creating node 1 of 2 ...
- setting up node
Starting worker HenrikBengtsson/future#1 on 'localhost': '/home/hornik/tmp/R.check/r-devel-gcc/Work/build/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11649 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE
Waiting for worker HenrikBengtsson/future#1 on 'localhost' to connect back
Warning in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
port 11649 cannot be opened
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
Calls: plan ... makeClusterPSOCK -> makeNode -> <Anonymous> -> socketConnection
Execution halted
BTW, as far as I can tell, the Warning in makeClusterPSOCK(workers, ...) : NAs introduced by coercion
occurs in the following two lines of makeClusterPSOCK()
:
port <- Sys.getenv("R_PARALLEL_PORT", NA_character_)
port <- as.integer(port)
For example:
> options(warn = 1L)
> Sys.setenv(R_PARALLEL_PORT = "dummy")
> cl <- future::makeClusterPSOCK("localhost", verbose = TRUE)
Workers: [n = 1] 'localhost'
Warning in future::makeClusterPSOCK("localhost", verbose = TRUE) :
NAs introduced by coercion
Base port: 11410
[...]
There is no warning if R_PARALLEL_PORT=""
or an integer or a numeric value.
I've encountered this error on my production systems recently. I'm not sure if the retry includes re-selecting a random port? If a port within the specified range (11000:11999
in default) is occupied by other processes, re-selecting a port becomes more effective than retry with the same port.
Add a wait-and-retry mechanism for setting up nodes when they fail. A node may fail to be set up due to the port being busy, e.g.
Comment: This type of error happens occasionally on the CRAN test servers because other parallel check processes may occupy one of the ports needed by the package being tested. I see this with the future package tests from time to time; the error often goes away the next CRAN-check cycle.
A wait-and-retry feature could be:
which would wait 30 seconds before retrying to set up a node. If it fails five times, no more retries will be done and the error will propagate up.
The arguments could default to: