Open eddienko opened 1 year ago
I agree that this would be a good improvement. Perhaps instead of making the DaskJob
behave the same way as Job
we should replace the internal Pod
in the DaskJob
with a Job
so that we can leverage the existing functionality.
Would it be possible to add backoffLimit to DaskJobs? Kubernetes jobs have this argument so that the job is reported as failed only it the pod fails a certain number of times (see below). Could we add these to DaskJobs as well? I have been using this argument in jobs because Dask sometimes "just hangs/crashes" in very long jobs and restarting the job fixes that.