Closed HenrikBengtsson closed 8 years ago
Hmm... there seems to be a bug in how base::setTimeLimit()
works preventing it from doing what we want, at least on Linux. See https://stat.ethz.ch/pipermail/r-devel/2016-October/073309.html
@MarkEdmondson1234, the develop version provides makeClusterPSOCK(..., connectTimeout = 2*60
) w/ 2 minutes connection timeout. unfortunately, due to the limitations in R (see above comment), it's currently not that useful on Linux and macOS (except on RStudio where I think it works).
When
makeClusterPSOCK()
is used to create a cluster of workers, each worker is created usingmakeNodePSOCK()
. A call tomakeNodePSOCK()
does two things:system(..., wait = FALSE)
.When the connection between the main process and the background R worker is established, then
makeNodePSOCK()
will return and then next R worker will be created and so on. Eventually all of the workers in the cluster are up and running with established connections. At this point,makeClusterPSOCK()
returns too.Issue
If there is an error launching a background R worker, then this error is never caught per se. This is because it is not possible, because we have to use
system(..., wait = FALSE)
, which in turn causes the return value to always be zero (= no error). Instead, the only waymakeNodePSOCK()
can become aware that the R worker failed to start is that the socket connection eventually times out, because the worker never connected back to it.Now, this timeout defaults to 30 days (sic!). The reason this timeout is large is because it allows for long-running cluster jobs. Some jobs may take hours and days to finish and since the connection will timeout whenever there is no communication done within this timeout period, there would be an error for long running jobs. So, we want to keep this timeout long.
However, when it comes to setting up the connection in the first place, it would be nice to be able to use a short timeout period, say, a few minutes in order to have time to connect to the worker machine, have it launch, say, Docker and within R, then install the future package if needed and eventually setup the PSOCK worker itself.
One possible solution is to make use of the
base::setTimeLimit()
functionality when waiting for the R worker to connect back. A more convenient implementation isR.utils::withTimeout()
. For instance,where
connectTimeout = 300
defaults to 5 minutes.This issue was triggered by @MarkEdmondson1234's question in https://github.com/cloudyr/googleComputeEngineR/issues/16.