in a multi-worker run e.g. on a cluster using slurm, is there a good way to free some resources, i.e., stopping/killing a worker without losing the result of the parameter combination the worker was testing? Currently, if I just kill a worker, the result for the corresponding parameter combination would just be lost and the next free worker would not continue or restart the parameter combination of the killed worker. Is there a way to kill a worker and the next free worker would just restart or continue the job of the killed worker?
Hey there,
in a multi-worker run e.g. on a cluster using slurm, is there a good way to free some resources, i.e., stopping/killing a worker without losing the result of the parameter combination the worker was testing? Currently, if I just kill a worker, the result for the corresponding parameter combination would just be lost and the next free worker would not continue or restart the parameter combination of the killed worker. Is there a way to kill a worker and the next free worker would just restart or continue the job of the killed worker?
Thanks Thomas