elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
198 stars 427 forks source link

Add state_lease data_stream for kubernetes integration #5363

Open ChrsMark opened 1 year ago

ChrsMark commented 1 year ago

Background context

One basic SLI/SLO for Kubernetes clusters should be on Leadership Switch.

Kubernetes native components including cluster-autoscaler, kube-controller-manager, and kube-scheduler are using leader-with-lease in client-go.

As Kubernetes operators we would like to monitor:

  1. (SLI_a) Leaderless: when there is no leader (need to define the SLI which would mean a critical error)
  2. (SLI_b) Time of leadership switch (need to define the SLI which would be a warning)

Being with no leader for a period of time is quite critical for a production Kubernetes cluster. Hence we need to define proper SLIs/SLOs based on these observations.

This information can be retrieved by the kube-state-metrics Service and look like the following:

# HELP kube_lease_owner Information about the Lease's owner.
# TYPE kube_lease_owner gauge
kube_lease_owner{lease="kind-control-plane",owner_kind="Node",owner_name="kind-control-plane"} 1
# HELP kube_lease_renew_time Kube lease renew time.
# TYPE kube_lease_renew_time gauge
kube_lease_renew_time{lease="kind-control-plane"} 1.676268601e+09

SLO_a: kube_lease_owner should not be equal to zero for more than 30 seconds. That should indicate a CRITICAL error. SLO_b: avg(kube_lease_renew_time) should not be greater than 0.5s for a period of last 10 mins. That should indicate a WARNING.

At the moment we don't have a specific metricset/data_stream that specifically collects this information from kube-state-metrics. Hence the goal of this issues is the following:

TODOs

  1. create the metricset/data_stream that specifically collects the lease information from kube-state-metrics
  2. provide some basic Watchers/Alerts similarly to https://github.com/elastic/integrations/issues/4997

FYI @gizas @rameshelastic @mlunadia

botelastic[bot] commented 7 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!