elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.59k stars 703 forks source link

Expose state of reconciliation as operator metrics #3559

Open pebrc opened 4 years ago

pebrc commented 4 years ago

I could be useful to expose metrics that reflect the internal state of the operator for monitoring purposes. This would enable admins to monitor and alert on ECK not being able to make progress. A few examples I can think of:

One open question is whether we should implicitly expose these events through errors which we could then pick up via APM which already comes integrated with Kibana Alerting for example. Alternatively this could just be an additional Prometheus metric we expose.

anyasabo commented 4 years ago

What metric might we expose? I'm struggling to think of a concrete example where we know we cannot make progress without some 3rd party intervention, and when it's probably a transient state

pebrc commented 4 years ago

I guess we could instrument the reconciliation and count whenever we return early and reset that count when a reconciliation succeeds. aborted_reconciliations_count{resource="default/elasticsearch"} 25

We could then either expose this metric directly or derive another metric with a threshold that expresses the stuck state from it something like:

stuck_reconciliation = aborted_reconciliations_count > 100

But there are some subtle issues with this approach. For example data migration would also count as an aborted reconciliation but it would be completely OK to be "stuck" in this state for prolonged periods of time on a downscale of a cluster with a lot of data for example. Unless of course the data migration itself is not progressing, which can be case if there are contradictory allocation filters set on some indices.

SeanPlacchetti commented 4 years ago

@pebrc We literally had this happen during a rolling upgrade and monitoring would have been amazing, it's definitely a blind spot where we're using Kibana/beats as an observability tool elsewhere, but for ECK we're still checking pod logs and events.

I'd add that we run multiple clusters and tend to update our observability cluster last and that cluster only has system generated indices. So, it's not down or an issue when we're doing upgrades. Also, that we ended up having to delete an index from two of our three data pods while one was looping, then blow the looping one away and let k8s spin up a fresh one and reallocate shards.