HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
957 stars 83 forks source link

makeNodePSOCK(): add `connectTimeout` argument #105

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 8 years ago

When makeClusterPSOCK() is used to create a cluster of workers, each worker is created using makeNodePSOCK(). A call to makeNodePSOCK() does two things:

  1. Launches an R worker in the background using system(..., wait = FALSE).
  2. Creates a socket connection and waits for the R worker to connect to it.

When the connection between the main process and the background R worker is established, then makeNodePSOCK() will return and then next R worker will be created and so on. Eventually all of the workers in the cluster are up and running with established connections. At this point, makeClusterPSOCK() returns too.

Issue

If there is an error launching a background R worker, then this error is never caught per se. This is because it is not possible, because we have to use system(..., wait = FALSE), which in turn causes the return value to always be zero (= no error). Instead, the only way makeNodePSOCK() can become aware that the R worker failed to start is that the socket connection eventually times out, because the worker never connected back to it.

Now, this timeout defaults to 30 days (sic!). The reason this timeout is large is because it allows for long-running cluster jobs. Some jobs may take hours and days to finish and since the connection will timeout whenever there is no communication done within this timeout period, there would be an error for long running jobs. So, we want to keep this timeout long.

However, when it comes to setting up the connection in the first place, it would be nice to be able to use a short timeout period, say, a few minutes in order to have time to connect to the worker machine, have it launch, say, Docker and within R, then install the future package if needed and eventually setup the PSOCK worker itself.

One possible solution is to make use of the base::setTimeLimit() functionality when waiting for the R worker to connect back. A more convenient implementation is R.utils::withTimeout(). For instance,

withTimeout({
  con <- socketConnection("localhost", port = port, server = TRUE, 
              blocking = TRUE, open = "a+b", timeout = timeout)
}, timeout = connectTimeout)

where connectTimeout = 300 defaults to 5 minutes.

This issue was triggered by @MarkEdmondson1234's question in https://github.com/cloudyr/googleComputeEngineR/issues/16.

HenrikBengtsson commented 8 years ago

Hmm... there seems to be a bug in how base::setTimeLimit() works preventing it from doing what we want, at least on Linux. See https://stat.ethz.ch/pipermail/r-devel/2016-October/073309.html

HenrikBengtsson commented 8 years ago

@MarkEdmondson1234, the develop version provides makeClusterPSOCK(..., connectTimeout = 2*60) w/ 2 minutes connection timeout. unfortunately, due to the limitations in R (see above comment), it's currently not that useful on Linux and macOS (except on RStudio where I think it works).