Open thtruo opened 2 years ago
Internal CRL note: this was originally raised as an internal Jira epic. We're tracking this explicitly as an issue in GH instead
I think this can be managed as a set of Metric gauges that can be set to the liveness enum value integer.
Is your feature request related to a problem? Please describe. Currently, monitoring and alerting on the
liveness_livenodes
metric from_status/vars
results in too many false positives, especially when autoscaling or temporary workload conditions result in nodes being unable to respond to liveness but are also not technically dead. The currentliveness_livenodes
metric does not disambiguate whether a node isLIVE
,UNAVAILABLE
,DECOMMISSIONING
,DECOMMISSIONED
, orDEAD
(see node states)Ideally, CRDB can export specific Prometheus metrics that correspond to a specific node state so that the end consumer who's monitoring CRDB can tell exactly how many nodes are in a particular liveness status.
Jira issue: CRDB-14788
Epic CRDB-32131