beacon-biosignals / julia_pod

k8s native julia development
MIT License
10 stars 3 forks source link

avoid zombie pod retries #14

Closed kolia closed 3 years ago

kolia commented 3 years ago

Currently julia pods that exit with non-zero status trigger a job retry, wherein the job re-spawns a pod which ends up being a zombie.

Apparently setting restartPolicy: Never on the pod spec is not enough.

Maybe setting the job spec's backoffLimit will do the trick.

Repro: start a julia_pod and in the julia repl do exit(42); the job will respawn a pod that nobody is attached to, i.e. a zombie.

omus commented 3 years ago

Keep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job status is type: Failed. That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds and .spec.backoffLimit result in a permanent Job failure that requires manual intervention to resolve.

https://kubernetes.io/docs/concepts/workloads/controllers/job/

kolia commented 3 years ago

@omus any thoughts on how to make it so that exit(42) doesn't lead to a zombie?

omus commented 3 years ago

backoffLimit: 0 seems to be what you want in combination with restartPolicy: Never.