appuio / nagios-plugins-openshift

Nagios/Icinga 2 Plugins for monitoring OpenShift clusters
BSD 3-Clause "New" or "Revised" License
26 stars 16 forks source link

Ignore failed count when determining whether Job succeeded #51

Closed simu closed 5 years ago

simu commented 5 years ago

The Kubernetes job controller will only set the completion time on a Job when the job's succeeded count is >= the Job's .spec.completions field if the field is set, or 1 otherwise. The controller will retry the job unless the retries have exceeded .spec.backoffLimit.

Therefore a Job can have a failed count of > 0 and a succeeded count of > 0 while still counting as successful, contrary to the previous implementation of the object_stats check.

This commit removes the condition of requiring a failed count == 0 to treat a Job execution as successful, as the presence of a completion timestamp in conjunction with no active jobs and > 0 completed jobs is sufficient to determine that a Job completed successfully, see the Job controller implementation [1].

[1] https://github.com/kubernetes/kubernetes/blob/8211cabfb2bf3b2b531b13589843130cb47df1b1/pkg/controller/job/job_controller.go#L518-L570

simu commented 5 years ago

Technically, the check could simply be the presence of the completion timestamp, as that field is only set when the Job controller determines that the Job has completed successfully, cf. the linked code.