Open ChrsMark opened 1 year ago
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
. Thank you for your contribution!
Background context
One basic SLI/SLO for Kubernetes clusters should be on
Leadership Switch
.Kubernetes native components including cluster-autoscaler, kube-controller-manager, and kube-scheduler are using leader-with-lease in client-go.
As Kubernetes operators we would like to monitor:
Being with no leader for a period of time is quite critical for a production Kubernetes cluster. Hence we need to define proper SLIs/SLOs based on these observations.
This information can be retrieved by the
kube-state-metrics
Service and look like the following:SLO_a:
kube_lease_owner
should not be equal to zero for more than 30 seconds. That should indicate a CRITICAL error. SLO_b:avg(kube_lease_renew_time)
should not be greater than0.5s
for a period of last 10 mins. That should indicate a WARNING.At the moment we don't have a specific metricset/data_stream that specifically collects this information from
kube-state-metrics
. Hence the goal of this issues is the following:TODOs
lease
information fromkube-state-metrics
FYI @gizas @rameshelastic @mlunadia