Open HenrikBengtsson opened 6 years ago
This would be a super helpful feature! Working on HPC via future.batchtools
I now have witnessed that quite often just one worker fails for unknown reasons, and then the whole batch fails just because of that. If the single task that failed could be relaunched automatically, it would be awesome.
Background
Being able to relaunch a future, that is re-evaluate a future expression that has already been evaluated in full or partially due to a failure, is useful when for instance the communication between master and worked failed. See Issues #154 and #188 for examples of such needs.
Implications
However, introducing the possibility to relaunch a future requires that the original state can be reproduced including global variables, which is currently not recorded for 'sequential' and 'multicore' futures when using
lazy = FALSE
. In other words, this addition to the Future API introduces implications on the contract that future backends needs to adhere to.Moreover, not all R expression can be re-evaluated, e.g. maybe a mutable object has been changed or a resource such as a the buffer of a read connection has been consumed. This suggests that we need a way to indicated whether an R expression can be re-evaluated or not, and if so, what should be the default. An alternative approach is to ignore this constraint and simply rely on the caller (the code that orchestrates the futures to handle such problems) - which also means that each of those will implement there own solution which is less ideal.
Another question to be asked is whether we can assume all backends support re-evaluation or not. Maybe "re-launchability" should be an optional feature of a future, cf. Issue #172 (DESIGN: Future API - Minimal/Essential API and Extended/Optional API).
Prototype
Ignoring all of the above complications, it's quite easy to relaunch a future. All that is needed is to reset the future (drop all collected results/values and reset the internal "state") and then relaunch it. None of this is part of the public frontend API - just wanted to say that the actually resetting part is easy. It's all of the above that holds us back from adding support for relaunching a (failed) future.