cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

Add Prometheus metrics reporting additional node liveness statuses #79390

Open thtruo opened 2 years ago

thtruo commented 2 years ago

Is your feature request related to a problem? Please describe. Currently, monitoring and alerting on the liveness_livenodes metric from _status/vars results in too many false positives, especially when autoscaling or temporary workload conditions result in nodes being unable to respond to liveness but are also not technically dead. The current liveness_livenodes metric does not disambiguate whether a node is LIVE, UNAVAILABLE, DECOMMISSIONING, DECOMMISSIONED, or DEAD (see node states)

Ideally, CRDB can export specific Prometheus metrics that correspond to a specific node state so that the end consumer who's monitoring CRDB can tell exactly how many nodes are in a particular liveness status.

Jira issue: CRDB-14788

Epic CRDB-32131

thtruo commented 2 years ago

Internal CRL note: this was originally raised as an internal Jira epic. We're tracking this explicitly as an issue in GH instead

dhartunian commented 2 years ago

I think this can be managed as a set of Metric gauges that can be set to the liveness enum value integer.