JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

Job dies when a node fails in SLURM #120

Closed mkschleg closed 5 years ago

mkschleg commented 5 years ago

So while using a cluster with slurm, infrequently you will have a node fail on you (meaning one or more than one of your "tasks" will get killed or not start). Once a single task faces this, the entire job is killed. What we would want to happen is the task to fail and not receive any new functions to run, while the other tasks are allowed to continue.

Is this possible here? Or would it be a change in the Distributed.jl stdlib? I've been looking into it, but don't have the expertise to really know where to start.

mkschleg commented 5 years ago

This might just be me being a dumb dumb.