dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 148 forks source link

Add backoffLimit to DaskJobs #695

Open eddienko opened 1 year ago

eddienko commented 1 year ago

Would it be possible to add backoffLimit to DaskJobs? Kubernetes jobs have this argument so that the job is reported as failed only it the pod fails a certain number of times (see below). Could we add these to DaskJobs as well? I have been using this argument in jobs because Dask sometimes "just hangs/crashes" in very long jobs and restarting the job fixes that.

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4
jacobtomlinson commented 1 year ago

I agree that this would be a good improvement. Perhaps instead of making the DaskJob behave the same way as Job we should replace the internal Pod in the DaskJob with a Job so that we can leverage the existing functionality.