Return ok when resource cannot be found

agapoff / check_kubernetes

Nagios/Icinga/Zabbix style plugin for checking Kubernetes

64 stars 36 forks source link

Return ok when resource cannot be found #13

Closed hydrapolic closed 3 years ago

hydrapolic commented 3 years ago

Hello, I'm trying to run all checks against a testing gcp k8s cluster and two of the checks return exit code 2:

apiserver
OK. Kuberenetes apiserver health is OK
0

components
OK. Healthy:  etcd-0 scheduler controller-manager etcd-1
0

nodes
OK. 1 nodes are Ready
0

pods
13 pods ready, 0 pods succeeded, 0 pods not ready
0

daemonsets
OK. 7 daemonsets are ready
0

unboundpvs
OK. 0 persistentvolumes correctly bound.
0

replicasets
OK. 12 replicasets are ready
0

statefulsets
No statefulsets found
2

tls
OK. 2 TLS secrets are OK
0

jobs
API call failed: the server could not find the requested resource
2

When put into nagios, those two checks will be printed as CRITICAL. Do you think we can adjust it so that statefulsets and jobs will behave like unboundpvs for example? None are found and it's returned as OK with return value 0.

Thanks for check_kubernetes :)

agapoff commented 3 years ago

Hi, I think that absence of objects that expected to be present is a critical issue. So if you have no jobs or statefulsets then just don't create those services in your monitoring setup. At least ignoring of absence of objects could be configurable but not enabled by default.

agapoff commented 3 years ago

But maybe jobs mode behavior can be modified though. It was PR'ed by external contributors and I haven't spent any time on thinking about it.

hydrapolic commented 3 years ago

Hi, I think that absence of objects that expected to be present is a critical issue. So if you have no jobs or statefulsets then just don't create those services in your monitoring setup. At least ignoring of absence of objects could be configurable but not enabled by default.

I thought it's more saying whether it really has a problem. For example, should check_load fire CRITICAL when load is 0.0? Or should check_mailq fire CRITICAL when there are no mails in the queue? Similarly, I thought having 0 (also meaning 0 problematical) object type should return OK.

If the behavior can be configured, it's ok by me. And yes, I could attach the very specific checks to each k8s cluster, but it's a tedious job, it's much easier to say, these are my k8s clusters, monitoring everything on them and if there is really an issue (like failed jobs or forever restarting deployments) please report.

agapoff commented 3 years ago

I have fixed "jobs" mode - it had a wrong API. So now the absence of jobs will be ignored and considered to be successful. But the CRITICAL exitcode for lack of statefulsets is explicitly defined in the code (which was PR'ed too) so I can't change this predefined default behavior.

hydrapolic commented 3 years ago

I have fixed "jobs" mode - it had a wrong API. So now the absence of jobs will be ignored and considered to be successful. But the CRITICAL exitcode for lack of statefulsets is explicitly defined in the code (which was PR'ed too) so I can't change this predefined default behavior.

Thank you, the jobs work as expected now. Maybe introduce a flag (OK_IF_MISSING) that would change EXITCODE=2 for 0 for statefulsets? If not, ok, I'll change locally in my repo :)

agapoff commented 3 years ago

Sure, your merge request is welcomed.