Long Running Remote Future Terminated

bwbioinfo commented 6 years ago

I have a long running future that runs remotely. It gets killed after ~2hours. I've run future in debug mode and I've looked at some of the logs on my system. There seems to be a point where the system management software causes a blip in ssh connectivity and it will kill the future if it intersects with a poll (the run has been successful in the past). Is it possible to pass an option to re-try the poll or extend the wait time?

bwbioinfo commented 6 years ago

Update: I tried to run it extending the time between polls and continue to receive the same error:

Error in unserialize(node$con) : Failed to retrieve the value of ClusterFuture from cluster node #1 (on ‘myserver’). The reason reported was ‘error reading from connection’ Calls: source ... value -> value.Future -> result -> result.ClusterFuture Execution halted

HenrikBengtsson commented 6 years ago

The "cluster" backend is a wrapper around the clusterApply framework of the parallel package (?parallel::clusterApply) with a PSOCK cluster (?parallel::makeCluster). To connect to other machines, the default protocol is SSH. I haven't tried, but the ssh client accepts lots of -o options. For instance, maybe you can use -o ConnectTimeout=<seconds> to workaround timeouts in the connection, if that's the underlaying cause.

If the connection drops, other than fixing the connection and the discussion on adding support for restarting futures (e.g. Issues #188, #205), I'm not sure there's anything that can be "fixed" in the future package per se.

Having said this, nothing prevents someone from implementing a more robust future backend that, for instance, can reconnect and restart a remote worker if it goes down. But, the underlying PSOCK workers provided by the parallel package don't support this and I'm pretty sure they never will. It's possible that batchtools has some mechanisms for this - I'm not fully up-to-date with its features but I know people asked about restarting batchtools if R crashes. If batchtools support this, then you could try with the future.batchtools backend.

futureverse / future

Long Running Remote Future Terminated #236