Add Prometheus metrics reporting additional node liveness statuses

thtruo commented 2 years ago

Is your feature request related to a problem? Please describe. Currently, monitoring and alerting on the liveness_livenodes metric from _status/vars results in too many false positives, especially when autoscaling or temporary workload conditions result in nodes being unable to respond to liveness but are also not technically dead. The current liveness_livenodes metric does not disambiguate whether a node is LIVE, UNAVAILABLE, DECOMMISSIONING, DECOMMISSIONED, or DEAD (see node states)

Ideally, CRDB can export specific Prometheus metrics that correspond to a specific node state so that the end consumer who's monitoring CRDB can tell exactly how many nodes are in a particular liveness status.

Jira issue: CRDB-14788

Epic CRDB-32131

thtruo commented 2 years ago

Internal CRL note: this was originally raised as an internal Jira epic. We're tracking this explicitly as an issue in GH instead

dhartunian commented 2 years ago

I think this can be managed as a set of Metric gauges that can be set to the liveness enum value integer.

cockroachdb / cockroach

Add Prometheus metrics reporting additional node liveness statuses #79390